Examples of TextExtractor

au.id.jericho.lib.html.TextExtractor
it.unimi.dsi.parser.callback.TextExtractor
net.htmlparser.jericho.TextExtractor
e.apache.org/java/">Apache Lucene, especially when the {@link #setIncludeAttributes(boolean) IncludeAttributes} property has been set to true.
Use one of the following methods to obtain the output:
- {@link #writeTo(Writer)}
- {@link #appendTo(Appendable)}
- {@link #toString()}
- {@link CharStreamSourceUtil#getReader(CharStreamSource) CharStreamSourceUtil.getReader(this)}
The process removes all of the tags and {@linkplain CharacterReference#decodeCollapseWhiteSpace(CharSequence) decodes the result, collapsing all white space}. A space character is included in the output where a normal tag is present in the source, unless the tag belongs to an {@linkplain HTMLElements#getInlineLevelElementNames() inline-level} element.An exception to this is the {@link HTMLElementName#BR BR} element, which is also converted to a space despite being an inline-level element.
Text inside {@link HTMLElementName#SCRIPT SCRIPT} and {@link HTMLElementName#STYLE STYLE} elements contained within this segmentis ignored.
Setting the {@link #setExcludeNonHTMLElements(boolean) ExcludeNonHTMLElements} property results in the exclusion of any content within anon-HTML element.
See the {@link #excludeElement(StartTag)} method for details on how to implement a more complex mechanism to determine whether the{@linkplain Element#getContent() content} of each {@link Element} is to be excluded from the output.
All tags that are not normal tags, such as {@linkplain TagType#isServerTag() server tags}, {@linkplain StartTagType#COMMENT comments} etc., are removed from the output without adding white space to the output.
Note that segments on which the {@link Segment#ignoreWhenParsing()} method has been called are treated as text rather than markup,resulting in their inclusion in the output. To remove specific segments before extracting the text, create an {@link OutputDocument} and call its {@link OutputDocument#remove(Segment) remove(Segment)} or{@link OutputDocument#replaceWithSpaces(int,int) replaceWithSpaces(int begin, int end)} method for each segment to be removed.Then create a new source document using {@link Source#Source(CharSequence) new Source(outputDocument.toString())}and perform the text extraction on this new source object.
Extracting the text from an entire {@link Source} object performs a {@linkplain Source#fullSequentialParse() full sequential parse} automatically.
To perform a simple rendering of HTML markup into text, which is more readable than the output of this class, use the {@link Renderer} class instead.

Example:

Using the default settings, the source segment:
"<div><b>O</b>ne</div><div title="Two"><b>Th</b><script>//a script </script>ree</div>"
produces the text "One Two Three".
org.apache.jackrabbit.extractor.TextExtractor
Interface for extracting text content from binary streams.
org.modeshape.jcr.api.text.TextExtractor
An abstraction for components that are able to extract text content from an input stream.
org.pdfclown.tools.TextExtractor
fanochizzolini.it) @since 0.0.8 @version 0.1.0
org.textmining.extraction.TextExtractor

Examples of au.id.jericho.lib.html.TextExtractor

  }
  
  private static String extractText(String htmlContent) {
    if (htmlContent != null && htmlContent.length() > 0) {
      Source source = new Source(htmlContent);
      TextExtractor extractor = new TextExtractor(source);
      extractor.setConvertNonBreakingSpaces(true);
      extractor.setExcludeNonHTMLElements(false);
      extractor.setIncludeAttributes(false);
      String output = extractor.toString();
      if (output != null && output.length() > 0) {
        return output;
      }
    }
    return null;

View Full Code Here

Examples of it.unimi.dsi.parser.callback.TextExtractor


  private void init() {
    this.parser = new BulletParser();
    
    ComposedCallbackBuilder composedBuilder = new ComposedCallbackBuilder();
    composedBuilder.add( this.textExtractor = new TextExtractor() );
    composedBuilder.add( this.anchorExtractor = new AnchorExtractor( maxPreAnchor, maxAnchor, maxPostAnchor ) ); 
    parser.setCallback( composedBuilder.compose() );


    Object o;
    try {

View Full Code Here

Examples of it.unimi.dsi.parser.callback.TextExtractor


  private Set<String> urls;


  public HTMLParser() {
    bulletParser = new BulletParser();
    textExtractor = new TextExtractor();
    linkExtractor = new LinkExtractor();
    
    linkExtractor.setIncludeImagesSources(Configurations
        .getBooleanProperty("crawler.include_images", false));
  }

View Full Code Here

Examples of net.htmlparser.jericho.TextExtractor

            //Search for primary field if present
            try {
                String itemName = getPrimaryNodeType().getPrimaryItemName();
                if (itemName != null) {
                    String s = getProperty(itemName).getValue().getString();
                    title = new TextExtractor(new Source(s != null ? s : getName())).toString();
                }
            } catch (RepositoryException e1) {
                title = null;
            }
        }

View Full Code Here

Examples of org.apache.jackrabbit.extractor.TextExtractor

     * Factory method to create the <code>TextExtractor</code> instance.
     *
     * @return the <code>TextExtractor</code> instance this index should use.
     */
    protected TextExtractor createTextExtractor() {
        TextExtractor txtExtr = new JackrabbitTextExtractor(textFilterClasses);
        if (extractorPoolSize > 0) {
            // wrap with pool
            txtExtr = new PooledTextExtractor(txtExtr, extractorPoolSize,
                    extractorBackLog, extractorTimeout);
        }

View Full Code Here

Examples of org.apache.jackrabbit.extractor.TextExtractor

     * Factory method to create the <code>TextExtractor</code> instance.
     *
     * @return the <code>TextExtractor</code> instance this index should use.
     */
    protected TextExtractor createTextExtractor() {
        TextExtractor txtExtr = new JackrabbitTextExtractor(textFilterClasses);
        if (extractorPoolSize > 0) {
            // wrap with pool
            txtExtr = new PooledTextExtractor(txtExtr, extractorPoolSize,
                    extractorBackLog, extractorTimeout);
        }

View Full Code Here

Examples of org.apache.jackrabbit.extractor.TextExtractor

     * Factory method to create the <code>TextExtractor</code> instance.
     *
     * @return the <code>TextExtractor</code> instance this index should use.
     */
    protected TextExtractor createTextExtractor() {
        TextExtractor txtExtr = new JackrabbitTextExtractor(textFilterClasses);
        if (extractorPoolSize > 0) {
            // wrap with pool
            txtExtr = new PooledTextExtractor(txtExtr, extractorPoolSize,
                    extractorBackLog, extractorTimeout);
        }

View Full Code Here

Examples of org.apache.jackrabbit.extractor.TextExtractor

     * Factory method to create the <code>TextExtractor</code> instance.
     *
     * @return the <code>TextExtractor</code> instance this index should use.
     */
    protected TextExtractor createTextExtractor() {
        TextExtractor txtExtr = new JackrabbitTextExtractor(textFilterClasses);
        if (extractorPoolSize > 0) {
            // wrap with pool
            txtExtr = new PooledTextExtractor(txtExtr, extractorPoolSize,
                    extractorBackLog, extractorTimeout);
        }

View Full Code Here

Examples of org.apache.jackrabbit.extractor.TextExtractor

     * Factory method to create the <code>TextExtractor</code> instance.
     *
     * @return the <code>TextExtractor</code> instance this index should use.
     */
    protected TextExtractor createTextExtractor() {
        TextExtractor txtExtr = new JackrabbitTextExtractor(textFilterClasses);
        if (extractorPoolSize > 0) {
            // wrap with pool
            txtExtr = new PooledTextExtractor(txtExtr, extractorPoolSize,
                    extractorBackLog, extractorTimeout);
        }

View Full Code Here

Examples of org.apache.jackrabbit.extractor.TextExtractor

     * Factory method to create the <code>TextExtractor</code> instance.
     *
     * @return the <code>TextExtractor</code> instance this index should use.
     */
    protected TextExtractor createTextExtractor() {
        TextExtractor txtExtr = new JackrabbitTextExtractor(textFilterClasses);
        if (extractorPoolSize > 0) {
            // wrap with pool
            txtExtr = new PooledTextExtractor(txtExtr, extractorPoolSize,
                    extractorBackLog, extractorTimeout);
        }

View Full Code Here

0 1 2

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.