Examples of TokenStream

antlr.TokenStream
com.dotcms.repackage.org.antlr.runtime.TokenStream
com.floreysoft.jmte.token.TokenStream
com.google.gwt.dev.js.rhino.TokenStream
This class implements the JavaScript scanner. It is based on the C source files jsscan.c and jsscan.h in the jsref package. @see org.mozilla.javascript.Parser @author Mike McCabe @author Brendan Eich
edu.buffalo.cse.ir.wikiindexer.tokenizer.TokenStream
This class represents a stream of tokens as the name suggests. It wraps the token stream and provides utility methods to manipulate it @author nikhillo
org.allspice.parser.parsetable.TokenStream
Represents a stream of token. @author egoff
org.antlr.runtime.TokenStream
A stream of tokens accessing tokens from a TokenSource
org.antlr.runtime3_3_0.TokenStream
A stream of tokens accessing tokens from a TokenSource
org.antlr.v4.runtime.TokenStream
An {@link IntStream} whose symbols are {@link Token} instances.
org.apache.lucene.analysis.TokenStream
A TokenStream enumerates the sequence of tokens, either from {@link Field}s of a {@link Document} or from query text.
This is an abstract class; concrete subclasses are:
- {@link Tokenizer}, a TokenStream whose input is a Reader; and
- {@link TokenFilter}, a TokenStream whose input is another TokenStream.
A new TokenStream API has been introduced with Lucene 2.9. This API has moved from being {@link Token}-based to {@link Attribute}-based. While {@link Token} still exists in 2.9 as a convenience class, the preferred wayto store the information of a {@link Token} is to use {@link AttributeImpl}s.
TokenStream now extends {@link AttributeSource}, which provides access to all of the token {@link Attribute}s for the TokenStream. Note that only one instance per {@link AttributeImpl} is created and reusedfor every token. This approach reduces object creation and allows local caching of references to the {@link AttributeImpl}s. See {@link #incrementToken()} for further details.
The workflow of the new TokenStream API is as follows:
1. Instantiation of TokenStream/ {@link TokenFilter}s which add/get attributes to/from the {@link AttributeSource}.
2. The consumer calls {@link TokenStream#reset()}.
3. The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.
4. The consumer calls {@link #incrementToken()} until it returns falseconsuming the attributes after each call.
5. The consumer calls {@link #end()} so that any end-of-stream operationscan be performed.
6. The consumer calls {@link #close()} to release any resource when finishedusing the TokenStream.
To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in {@link #incrementToken()}.
You can find some example code for the new API in the analysis package level Javadoc.
Sometimes it is desirable to capture a current state of a TokenStream, e.g., for buffering purposes (see {@link CachingTokenFilter}, {@link TeeSinkTokenFilter}). For this usecase {@link AttributeSource#captureState} and {@link AttributeSource#restoreState}can be used.
org.jboss.dna.common.text.TokenStream
e the columns empty to signal wildcard } else { // Read names until we see a ',' do { String columnName = tokens.consume(); if (tokens.canConsume("AS")) { String columnAlias = tokens.consume(); columns.add(new Column(columnName, columnAlias)); } else { columns.add(new Column(columnName, null)); } } while (tokens.canConsume(',')); } return columns; } protected Delete parseDelete( TokenStream tokens ) throws ParsingException { tokens.consume("DELETE", "FROM"); String tableName = tokens.consume(); tokens.consume("WHERE"); String lhs = tokens.consume(); tokens.consume('='); String rhs = tokens.consume(); return new Delete(tableName, new Criteria(lhs, rhs)); } } public abstract class Statement { ... } public class Query extends Statement { ... } public class Delete extends Statement { ... } public class Column { ... } This example shows an idiomatic way of writing a parser that is stateless and thread-safe. The parse(...) method takes the input as a parameter, and returns the domain-specific representation that resulted from the parsing. All other methods are utility methods that simply encapsulate common logic or make the code more readable.

In the example, the parse(...) first creates a TokenStream object (using a Tokenizer implementation that is not shown), and then loops as long as there are more tokens to read. As it loops, if the next token is "SELECT", the parser calls the parseSelect(...) method which immediately consumes a "SELECT" token, the names of the columns separated by commas (or a '*' if there all columns are to be selected), a "FROM" token, and the name of the table being queried. The parseSelect(...) method returns a Select object, which then added to the list of statements in the parse(...) method. The parser handles the "DELETE" statements in a similar manner.

Case sensitivity

Very often grammars to not require the case of keywords to match. This can make parsing a challenge, because all combinations of case need to be used. The TokenStream framework provides a very simple solution that requires no more effort than providing a boolean parameter to the constructor.

When a false value is provided for the the caseSensitive parameter, the TokenStream performs all matching operations as if each token's value were in uppercase only. This means that the arguments supplied to the match(...), canConsume(...), and consume(...) methods should be upper-cased. Note that the actual value of each token remains the actual case as it appears in the input.

Of course, when the TokenStream is created with a true value for the caseSensitive parameter, the matching is performed using the actual value as it appears in the input content

Whitespace

Many grammars are independent of lines breaks or whitespace, allowing a lot of flexibility when writing the content. The TokenStream framework makes it very easy to ignore line breaks and whitespace. To do so, the Tokenizer implementation must simply not include the line break character sequences and whitespace in the token ranges. Since none of the tokens contain whitespace, the parser never has to deal with them.

Of course, many parsers will require that some whitespace be included. For example, whitespace within a quoted string may be needed by the parser. In this case, the Tokenizer should simply include the whitespace characters in the tokens.

Writing a Tokenizer

Each parser will likely have its own {@link Tokenizer} implementation that contains the parser-specific logic about how tobreak the content into token objects. Generally, the easiest way to do this is to simply iterate through the character sequence passed into the {@link Tokenizer#tokenize(CharacterStream,Tokens) tokenize(...)} method, and use a switch statement to decidewhat to do.

Here is the code for a very basic Tokenizer implementation that ignores whitespace, line breaks and Java-style (multi-line and end-of-line) comments, while constructing single tokens for each quoted string.
```
 public class BasicTokenizer implements Tokenizer { public void tokenize( CharacterStream input, Tokens tokens ) throws ParsingException { while (input.hasNext()) { char c = input.next(); switch (c) { case ' ': case '\t': case '\n': case '\r': // Just skip these whitespace characters ... break; case '-': case '(': case ')': case '{': case '}': case '*': case ',': case ';': case '+': case '%': case '?': case '$': case '[': case ']': case '!': case '<': case '>': case '|': case '=': case ':': tokens.addToken(input.index(), input.index() + 1, SYMBOL); break; case '.': tokens.addToken(input.index(), input.index() + 1, DECIMAL); break; case '\"': case '\"': int startIndex = input.index(); Position startingPosition = input.position(); boolean foundClosingQuote = false; while (input.hasNext()) { c = input.next(); if (c == '\\' && input.isNext('"')) { c = input.next(); // consume the ' character since it is escaped } else if (c == '"') { foundClosingQuote = true; break; } } if (!foundClosingQuote) { throw new ParsingException(startingPosition, "No matching closing double quote found"); } int endIndex = input.index() + 1; // beyond last character read tokens.addToken(startIndex, endIndex, DOUBLE_QUOTED_STRING); break; case '\'': startIndex = input.index(); startingPosition = input.position(); foundClosingQuote = false; while (input.hasNext()) { c = input.next(); if (c == '\\' && input.isNext('\'')) { c = input.next(); // consume the ' character since it is escaped } else if (c == '\'') { foundClosingQuote = true; break; } } if (!foundClosingQuote) { throw new ParsingException(startingPosition, "No matching closing single quote found"); } endIndex = input.index() + 1; // beyond last character read tokens.addToken(startIndex, endIndex, SINGLE_QUOTED_STRING); break; case '/': startIndex = input.index(); if (input.isNext('/')) { // End-of-line comment ... boolean foundLineTerminator = false; while (input.hasNext()) { c = input.next(); if (c == '\n' || c == '\r') { foundLineTerminator = true; break; } } endIndex = input.index(); // the token won't include the '\n' or '\r' character(s) if (!foundLineTerminator) ++endIndex; // must point beyond last char if (c == '\r' && input.isNext('\n')) input.next(); if (useComments) { tokens.addToken(startIndex, endIndex, COMMENT); } } else if (input.isNext('*')) { // Multi-line comment ... while (input.hasNext() && !input.isNext('*', '/')) { c = input.next(); } if (input.hasNext()) input.next(); // consume the '*' if (input.hasNext()) input.next(); // consume the '/' if (useComments) { endIndex = input.index() + 1; // the token will include the '/' and '*' characters tokens.addToken(startIndex, endIndex, COMMENT); } } else { // just a regular slash ... tokens.addToken(startIndex, startIndex + 1, SYMBOL); } break; default: startIndex = input.index(); // Read until another whitespace/symbol/decimal/slash is found while (input.hasNext() && !(input.isNextWhitespace() || input.isNextAnyOf("/.-(){}*,;+%?$[]!<>|=:"))) { c = input.next(); } endIndex = input.index() + 1; // beyond last character that was included tokens.addToken(startIndex, endIndex, WORD); } } } } 
```
Tokenizers with exactly this behavior can actually be created using the {@link #basicTokenizer(boolean)} method. So while this verybasic implementation is not meant to be used in all situations, it may be useful in some situations.
org.modeshape.common.text.TokenStream
e the columns empty to signal wildcard } else { // Read names until we see a ',' do { String columnName = tokens.consume(); if (tokens.canConsume("AS")) { String columnAlias = tokens.consume(); columns.add(new Column(columnName, columnAlias)); } else { columns.add(new Column(columnName, null)); } } while (tokens.canConsume(',')); } return columns; } protected Delete parseDelete( TokenStream tokens ) throws ParsingException { tokens.consume("DELETE", "FROM"); String tableName = tokens.consume(); tokens.consume("WHERE"); String lhs = tokens.consume(); tokens.consume('='); String rhs = tokens.consume(); return new Delete(tableName, new Criteria(lhs, rhs)); } } public abstract class Statement { ... } public class Query extends Statement { ... } public class Delete extends Statement { ... } public class Column { ... } This example shows an idiomatic way of writing a parser that is stateless and thread-safe. The parse(...) method takes the input as a parameter, and returns the domain-specific representation that resulted from the parsing. All other methods are utility methods that simply encapsulate common logic or make the code more readable.

In the example, the parse(...) first creates a TokenStream object (using a Tokenizer implementation that is not shown), and then loops as long as there are more tokens to read. As it loops, if the next token is "SELECT", the parser calls the parseSelect(...) method which immediately consumes a "SELECT" token, the names of the columns separated by commas (or a '*' if there all columns are to be selected), a "FROM" token, and the name of the table being queried. The parseSelect(...) method returns a Select object, which then added to the list of statements in the parse(...) method. The parser handles the "DELETE" statements in a similar manner.

Case sensitivity

Very often grammars to not require the case of keywords to match. This can make parsing a challenge, because all combinations of case need to be used. The TokenStream framework provides a very simple solution that requires no more effort than providing a boolean parameter to the constructor.

When a false value is provided for the the caseSensitive parameter, the TokenStream performs all matching operations as if each token's value were in uppercase only. This means that the arguments supplied to the match(...), canConsume(...), and consume(...) methods should be upper-cased. Note that the actual value of each token remains the actual case as it appears in the input.

Of course, when the TokenStream is created with a true value for the caseSensitive parameter, the matching is performed using the actual value as it appears in the input content

Whitespace

Many grammars are independent of lines breaks or whitespace, allowing a lot of flexibility when writing the content. The TokenStream framework makes it very easy to ignore line breaks and whitespace. To do so, the Tokenizer implementation must simply not include the line break character sequences and whitespace in the token ranges. Since none of the tokens contain whitespace, the parser never has to deal with them.

Of course, many parsers will require that some whitespace be included. For example, whitespace within a quoted string may be needed by the parser. In this case, the Tokenizer should simply include the whitespace characters in the tokens.

Writing a Tokenizer

Each parser will likely have its own {@link Tokenizer} implementation that contains the parser-specific logic about how tobreak the content into token objects. Generally, the easiest way to do this is to simply iterate through the character sequence passed into the {@link Tokenizer#tokenize(CharacterStream,Tokens) tokenize(...)} method, and use a switch statement to decidewhat to do.

Here is the code for a very basic Tokenizer implementation that ignores whitespace, line breaks and Java-style (multi-line and end-of-line) comments, while constructing single tokens for each quoted string.
```
 public class BasicTokenizer implements Tokenizer { public void tokenize( CharacterStream input, Tokens tokens ) throws ParsingException { while (input.hasNext()) { char c = input.next(); switch (c) { case ' ': case '\t': case '\n': case '\r': // Just skip these whitespace characters ... break; case '-': case '(': case ')': case '{': case '}': case '*': case ',': case ';': case '+': case '%': case '?': case '$': case '[': case ']': case '!': case '<': case '>': case '|': case '=': case ':': tokens.addToken(input.index(), input.index() + 1, SYMBOL); break; case '.': tokens.addToken(input.index(), input.index() + 1, DECIMAL); break; case '\"': case '\"': int startIndex = input.index(); Position startingPosition = input.position(); boolean foundClosingQuote = false; while (input.hasNext()) { c = input.next(); if (c == '\\' && input.isNext('"')) { c = input.next(); // consume the ' character since it is escaped } else if (c == '"') { foundClosingQuote = true; break; } } if (!foundClosingQuote) { throw new ParsingException(startingPosition, "No matching closing double quote found"); } int endIndex = input.index() + 1; // beyond last character read tokens.addToken(startIndex, endIndex, DOUBLE_QUOTED_STRING); break; case '\'': startIndex = input.index(); startingPosition = input.position(); foundClosingQuote = false; while (input.hasNext()) { c = input.next(); if (c == '\\' && input.isNext('\'')) { c = input.next(); // consume the ' character since it is escaped } else if (c == '\'') { foundClosingQuote = true; break; } } if (!foundClosingQuote) { throw new ParsingException(startingPosition, "No matching closing single quote found"); } endIndex = input.index() + 1; // beyond last character read tokens.addToken(startIndex, endIndex, SINGLE_QUOTED_STRING); break; case '/': startIndex = input.index(); if (input.isNext('/')) { // End-of-line comment ... boolean foundLineTerminator = false; while (input.hasNext()) { c = input.next(); if (c == '\n' || c == '\r') { foundLineTerminator = true; break; } } endIndex = input.index(); // the token won't include the '\n' or '\r' character(s) if (!foundLineTerminator) ++endIndex; // must point beyond last char if (c == '\r' && input.isNext('\n')) input.next(); if (useComments) { tokens.addToken(startIndex, endIndex, COMMENT); } } else if (input.isNext('*')) { // Multi-line comment ... while (input.hasNext() && !input.isNext('*', '/')) { c = input.next(); } if (input.hasNext()) input.next(); // consume the '*' if (input.hasNext()) input.next(); // consume the '/' if (useComments) { endIndex = input.index() + 1; // the token will include the '/' and '*' characters tokens.addToken(startIndex, endIndex, COMMENT); } } else { // just a regular slash ... tokens.addToken(startIndex, startIndex + 1, SYMBOL); } break; default: startIndex = input.index(); // Read until another whitespace/symbol/decimal/slash is found while (input.hasNext() && !(input.isNextWhitespace() || input.isNextAnyOf("/.-(){}*,;+%?$[]!<>|=:"))) { c = input.next(); } endIndex = input.index() + 1; // beyond last character that was included tokens.addToken(startIndex, endIndex, WORD); } } } } 
```
Tokenizers with exactly this behavior can actually be created using the {@link #basicTokenizer(boolean)} method. So while this verybasic implementation is not meant to be used in all situations, it may be useful in some situations.

Examples of org.apache.lucene.analysis.TokenStream

        TopDocs hits = indexSearcher.search(phraseQuery, 1);
        assertEquals(1, hits.totalHits);
        final Highlighter highlighter = new Highlighter(
            new SimpleHTMLFormatter(), new SimpleHTMLEncoder(),
            new QueryScorer(phraseQuery));
        final TokenStream tokenStream = TokenSources
            .getTokenStream(
                (TermPositionVector) indexReader.getTermFreqVector(0, FIELD),
                false);
        assertEquals("<B>the fox</B> did not jump",
            highlighter.getBestFragment(tokenStream, TEXT));

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

        TopDocs hits = indexSearcher.search(phraseQuery, 1);
        assertEquals(1, hits.totalHits);
        final Highlighter highlighter = new Highlighter(
            new SimpleHTMLFormatter(), new SimpleHTMLEncoder(),
            new QueryScorer(phraseQuery));
        final TokenStream tokenStream = TokenSources
            .getTokenStream(
                (TermPositionVector) indexReader.getTermFreqVector(0, FIELD),
                false);
        assertEquals("<B>the fox</B> did not jump",
            highlighter.getBestFragment(tokenStream, TEXT));

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

  public void testPreAnalyzedField() throws IOException {
    IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
        TEST_VERSION_CURRENT, new MockAnalyzer(random)));
    Document doc = new Document();
    
    doc.add(new Field("preanalyzed", new TokenStream() {
      private String[] tokens = new String[] {"term1", "term2", "term3", "term2"};
      private int index = 0;
      
      private CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream


    for (int i = 0; i < hits.scoreDocs.length; i++) {
      Document doc = searcher.doc(hits.scoreDocs[i].doc);
      String storedField = doc.get(FIELD_NAME);


      TokenStream stream = TokenSources.getAnyTokenStream(searcher
          .getIndexReader(), hits.scoreDocs[i].doc, FIELD_NAME, doc, analyzer);


      Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);


      highlighter.setTextFragmenter(fragmenter);

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

   * This method intended for use with <tt>testHighlightingWithDefaultField()</tt>
 * @throws InvalidTokenOffsetsException 
   */
  private static String highlightField(Query query, String fieldName, String text)
      throws IOException, InvalidTokenOffsetsException {
    TokenStream tokenStream = new StandardAnalyzer(TEST_VERSION_CURRENT).tokenStream(fieldName, new StringReader(text));
    // Assuming "<B>", "</B>" used to highlight
    SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();
    QueryScorer scorer = new QueryScorer(query, fieldName, FIELD_NAME);
    Highlighter highlighter = new Highlighter(formatter, scorer);
    highlighter.setTextFragmenter(new SimpleFragmenter(Integer.MAX_VALUE));

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

    QueryScorer scorer = new QueryScorer(query, FIELD_NAME);
    Highlighter highlighter = new Highlighter(scorer);
    
    for (int i = 0; i < hits.totalHits; i++) {
      String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
      TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME,
          new StringReader(text));
      highlighter.setTextFragmenter(new SimpleFragmenter(40));


      String result = highlighter.getBestFragments(tokenStream, text, maxNumFragmentsRequired,
          "...");

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

    QueryScorer scorer = new QueryScorer(query, FIELD_NAME);
    Highlighter highlighter = new Highlighter(this, scorer);
    
    for (int i = 0; i < hits.totalHits; i++) {
      String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
      TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text));


      highlighter.setTextFragmenter(new SimpleFragmenter(40));


      String result = highlighter.getBestFragments(tokenStream, text, maxNumFragmentsRequired,
          "...");
      if (VERBOSE) System.out.println("\t" + result);
    }


    assertTrue("Failed to find correct number of highlights " + numHighlights + " found",
        numHighlights == 3);
    
    numHighlights = 0;
    doSearching("\"This piece of text refers to Kennedy\"");


    maxNumFragmentsRequired = 2;


    scorer = new QueryScorer(query, FIELD_NAME);
    highlighter = new Highlighter(this, scorer);
    
    for (int i = 0; i < hits.totalHits; i++) {
      String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
      TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text));


      highlighter.setTextFragmenter(new SimpleFragmenter(40));


      String result = highlighter.getBestFragments(tokenStream, text, maxNumFragmentsRequired,
          "...");
      if (VERBOSE) System.out.println("\t" + result);
    }


    assertTrue("Failed to find correct number of highlights " + numHighlights + " found",
        numHighlights == 4);
    
    numHighlights = 0;
    doSearching("\"lets is a the lets is a the lets is a the lets\"");


    maxNumFragmentsRequired = 2;


    scorer = new QueryScorer(query, FIELD_NAME);
    highlighter = new Highlighter(this, scorer);
    
    for (int i = 0; i < hits.totalHits; i++) {
      String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
      TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text));


      highlighter.setTextFragmenter(new SimpleFragmenter(40));


      String result = highlighter.getBestFragments(tokenStream, text, maxNumFragmentsRequired,
          "...");

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

    QueryScorer scorer = new QueryScorer(query, FIELD_NAME);
    Highlighter highlighter = new Highlighter(this, scorer);
    
    for (int i = 0; i < hits.totalHits; i++) {
      String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
      TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text));


      highlighter.setTextFragmenter(new SimpleFragmenter(40));


      String result = highlighter.getBestFragments(tokenStream, text, maxNumFragmentsRequired,
          "...");

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

    QueryScorer scorer = new QueryScorer(query, FIELD_NAME);
    Highlighter highlighter = new Highlighter(this, scorer);
    
    for (int i = 0; i < hits.totalHits; i++) {
      String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
      TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text));


      highlighter.setTextFragmenter(new SimpleFragmenter(40));


      String result = highlighter.getBestFragments(tokenStream, text, maxNumFragmentsRequired,
          "...");

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

    QueryScorer scorer = new QueryScorer(query, FIELD_NAME);
    Highlighter highlighter = new Highlighter(this, scorer);
    
    for (int i = 0; i < hits.totalHits; i++) {
      String text = searcher.doc(hits.scoreDocs[i].doc).get(NUMERIC_FIELD_NAME);
      TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new StringReader(text));


      highlighter.setTextFragmenter(new SimpleFragmenter(40));


//      String result = 
        highlighter.getBestFragments(tokenStream, text, maxNumFragmentsRequired,"...");

View Full Code Here

0 1 2 3 4 5 6

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.

Examples of TokenStream

Case sensitivity

Whitespace

Writing a Tokenizer

Case sensitivity

Whitespace

Writing a Tokenizer

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream