Examples of org.apache.lucene.analysis.TokenStream

org.apache.lucene.analysis.TokenStream
A TokenStream enumerates the sequence of tokens, either from {@link Field}s of a {@link Document} or from query text.
This is an abstract class; concrete subclasses are:
- {@link Tokenizer}, a TokenStream whose input is a Reader; and
- {@link TokenFilter}, a TokenStream whose input is another TokenStream.
A new TokenStream API has been introduced with Lucene 2.9. This API has moved from being {@link Token}-based to {@link Attribute}-based. While {@link Token} still exists in 2.9 as a convenience class, the preferred wayto store the information of a {@link Token} is to use {@link AttributeImpl}s.
TokenStream now extends {@link AttributeSource}, which provides access to all of the token {@link Attribute}s for the TokenStream. Note that only one instance per {@link AttributeImpl} is created and reusedfor every token. This approach reduces object creation and allows local caching of references to the {@link AttributeImpl}s. See {@link #incrementToken()} for further details.
The workflow of the new TokenStream API is as follows:
1. Instantiation of TokenStream/ {@link TokenFilter}s which add/get attributes to/from the {@link AttributeSource}.
2. The consumer calls {@link TokenStream#reset()}.
3. The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.
4. The consumer calls {@link #incrementToken()} until it returns falseconsuming the attributes after each call.
5. The consumer calls {@link #end()} so that any end-of-stream operationscan be performed.
6. The consumer calls {@link #close()} to release any resource when finishedusing the TokenStream.
To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in {@link #incrementToken()}.
You can find some example code for the new API in the analysis package level Javadoc.
Sometimes it is desirable to capture a current state of a TokenStream, e.g., for buffering purposes (see {@link CachingTokenFilter}, {@link TeeSinkTokenFilter}). For this usecase {@link AttributeSource#captureState} and {@link AttributeSource#restoreState}can be used.


  public void assertExpectedHighlightCount(final int maxNumFragmentsRequired,
      final int expectedHighlights) throws Exception {
    for (int i = 0; i < hits.totalHits; i++) {
      String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
      TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text));
      QueryScorer scorer = new QueryScorer(query, FIELD_NAME);
      Highlighter highlighter = new Highlighter(this, scorer);


      highlighter.setTextFragmenter(new SimpleFragmenter(40));

View Full Code Here

     *   and {@link RussianStemFilter}
     */
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader)
    {
        TokenStream result = new RussianLetterTokenizer(reader);
        result = new LowerCaseFilter(result);
        result = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
                                result, stopSet);
        result = new RussianStemFilter(result);
        return result;

View Full Code Here

   *         filtered with {@link StandardFilter}, {@link StopFilter}, 
   *         {@link FrenchStemFilter} and {@link LowerCaseFilter}
   */
  @Override
  public final TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new StandardTokenizer(matchVersion, reader);
    result = new StandardFilter(result);
    result = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
                            result, stoptable);
    result = new FrenchStemFilter(result, excltable);
    // Convert to lowercase after stemming!

View Full Code Here

    this.outputUnigrams = outputUnigrams;
  }


  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream wrapped;
    try {
      wrapped = defaultAnalyzer.reusableTokenStream(fieldName, reader);
    } catch (IOException e) {
      wrapped = defaultAnalyzer.tokenStream(fieldName, reader);
    }

View Full Code Here

      streams = new SavedStreams();
      streams.wrapped = defaultAnalyzer.reusableTokenStream(fieldName, reader);
      streams.shingle = new ShingleFilter(streams.wrapped);
      setPreviousTokenStream(streams);
    } else {
      TokenStream result = defaultAnalyzer.reusableTokenStream(fieldName, reader);
      if (result == streams.wrapped) {
        /* the wrapped analyzer reused the stream */
        streams.shingle.reset(); 
      } else {
        /* the wrapped analyzer did not, create a new shingle around the new one */

View Full Code Here

   *         {@link ArabicNormalizationFilter},
   *         {@link PersianNormalizationFilter} and Persian Stop words
   */
  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new ArabicLetterTokenizer(reader);
    result = new LowerCaseFilter(result);
    result = new ArabicNormalizationFilter(result);
    /* additional persian-specific normalization */
    result = new PersianNormalizationFilter(result);
    /*

View Full Code Here

   *   filtered with {@link StandardFilter}, {@link StopFilter}, 
   *   and {@link DutchStemFilter}
   */
  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new StandardTokenizer(matchVersion, reader);
    result = new StandardFilter(result);
    result = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
                            result, stoptable);
    result = new DutchStemFilter(result, excltable, stemdict);
    return result;

View Full Code Here

  }
  
  public void testTokenStream() throws Exception {
    QueryAutoStopWordAnalyzer a = new QueryAutoStopWordAnalyzer(Version.LUCENE_CURRENT, new WhitespaceAnalyzer());
    a.addStopWords(reader, 10);
    TokenStream ts = a.tokenStream("repetitiveField", new StringReader("this boring"));
    TermAttribute termAtt = ts.getAttribute(TermAttribute.class);
    assertTrue(ts.incrementToken());
    assertEquals("this", termAtt.term());
    assertFalse(ts.incrementToken());
  }

View Full Code Here

    * @return  A {@link TokenStream} built from a {@link ChineseTokenizer} 
    *   filtered with {@link ChineseFilter}.
    */
    @Override
    public final TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream result = new ChineseTokenizer(reader);
        result = new ChineseFilter(result);
        return result;
    }

View Full Code Here

    this.collator = collator;
  }


  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new KeywordTokenizer(reader);
    result = new ICUCollationKeyFilter(result, collator);
    return result;
  }

View Full Code Here

0 1 2 3 4 5 6 7 8 9

TOP

Related Classes of org.apache.lucene.analysis.TokenStream

com.code972.elasticsearch.rest.action.RestHebrewAnalyzerCheckWordAction

com.gentics.cr.lucene.analysis.CustomPatternAnalyzerTest

com.github.pmerienne.trident.ml.preprocessing.EnglishTokenizer

com.o19s.RegexPathHierarchyTokenizerTest

com.stimulus.archiva.search.FileNameAnalyzer

com.stimulus.archiva.search.FileNameAnalyzer$LowercaseDelimiterAnalyzer

ivory.core.tokenize.LuceneAnalyzer

ivory.core.tokenize.LuceneArabicAnalyzer

ivory.core.tokenize.LuceneTokenizer

org.apache.jackrabbit.core.query.lucene.SearchIndex

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.