Examples of TokenStream

antlr.TokenStream
com.dotcms.repackage.org.antlr.runtime.TokenStream
com.floreysoft.jmte.token.TokenStream
com.google.gwt.dev.js.rhino.TokenStream
This class implements the JavaScript scanner. It is based on the C source files jsscan.c and jsscan.h in the jsref package. @see org.mozilla.javascript.Parser @author Mike McCabe @author Brendan Eich
edu.buffalo.cse.ir.wikiindexer.tokenizer.TokenStream
This class represents a stream of tokens as the name suggests. It wraps the token stream and provides utility methods to manipulate it @author nikhillo
org.allspice.parser.parsetable.TokenStream
Represents a stream of token. @author egoff
org.antlr.runtime.TokenStream
A stream of tokens accessing tokens from a TokenSource
org.antlr.runtime3_3_0.TokenStream
A stream of tokens accessing tokens from a TokenSource
org.antlr.v4.runtime.TokenStream
An {@link IntStream} whose symbols are {@link Token} instances.
org.apache.lucene.analysis.TokenStream
A TokenStream enumerates the sequence of tokens, either from {@link Field}s of a {@link Document} or from query text.
This is an abstract class; concrete subclasses are:
- {@link Tokenizer}, a TokenStream whose input is a Reader; and
- {@link TokenFilter}, a TokenStream whose input is another TokenStream.
A new TokenStream API has been introduced with Lucene 2.9. This API has moved from being {@link Token}-based to {@link Attribute}-based. While {@link Token} still exists in 2.9 as a convenience class, the preferred wayto store the information of a {@link Token} is to use {@link AttributeImpl}s.
TokenStream now extends {@link AttributeSource}, which provides access to all of the token {@link Attribute}s for the TokenStream. Note that only one instance per {@link AttributeImpl} is created and reusedfor every token. This approach reduces object creation and allows local caching of references to the {@link AttributeImpl}s. See {@link #incrementToken()} for further details.
The workflow of the new TokenStream API is as follows:
1. Instantiation of TokenStream/ {@link TokenFilter}s which add/get attributes to/from the {@link AttributeSource}.
2. The consumer calls {@link TokenStream#reset()}.
3. The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.
4. The consumer calls {@link #incrementToken()} until it returns falseconsuming the attributes after each call.
5. The consumer calls {@link #end()} so that any end-of-stream operationscan be performed.
6. The consumer calls {@link #close()} to release any resource when finishedusing the TokenStream.
To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in {@link #incrementToken()}.
You can find some example code for the new API in the analysis package level Javadoc.
Sometimes it is desirable to capture a current state of a TokenStream, e.g., for buffering purposes (see {@link CachingTokenFilter}, {@link TeeSinkTokenFilter}). For this usecase {@link AttributeSource#captureState} and {@link AttributeSource#restoreState}can be used.
org.jboss.dna.common.text.TokenStream
e the columns empty to signal wildcard } else { // Read names until we see a ',' do { String columnName = tokens.consume(); if (tokens.canConsume("AS")) { String columnAlias = tokens.consume(); columns.add(new Column(columnName, columnAlias)); } else { columns.add(new Column(columnName, null)); } } while (tokens.canConsume(',')); } return columns; } protected Delete parseDelete( TokenStream tokens ) throws ParsingException { tokens.consume("DELETE", "FROM"); String tableName = tokens.consume(); tokens.consume("WHERE"); String lhs = tokens.consume(); tokens.consume('='); String rhs = tokens.consume(); return new Delete(tableName, new Criteria(lhs, rhs)); } } public abstract class Statement { ... } public class Query extends Statement { ... } public class Delete extends Statement { ... } public class Column { ... } This example shows an idiomatic way of writing a parser that is stateless and thread-safe. The parse(...) method takes the input as a parameter, and returns the domain-specific representation that resulted from the parsing. All other methods are utility methods that simply encapsulate common logic or make the code more readable.

In the example, the parse(...) first creates a TokenStream object (using a Tokenizer implementation that is not shown), and then loops as long as there are more tokens to read. As it loops, if the next token is "SELECT", the parser calls the parseSelect(...) method which immediately consumes a "SELECT" token, the names of the columns separated by commas (or a '*' if there all columns are to be selected), a "FROM" token, and the name of the table being queried. The parseSelect(...) method returns a Select object, which then added to the list of statements in the parse(...) method. The parser handles the "DELETE" statements in a similar manner.

Case sensitivity

Very often grammars to not require the case of keywords to match. This can make parsing a challenge, because all combinations of case need to be used. The TokenStream framework provides a very simple solution that requires no more effort than providing a boolean parameter to the constructor.

When a false value is provided for the the caseSensitive parameter, the TokenStream performs all matching operations as if each token's value were in uppercase only. This means that the arguments supplied to the match(...), canConsume(...), and consume(...) methods should be upper-cased. Note that the actual value of each token remains the actual case as it appears in the input.

Of course, when the TokenStream is created with a true value for the caseSensitive parameter, the matching is performed using the actual value as it appears in the input content

Whitespace

Many grammars are independent of lines breaks or whitespace, allowing a lot of flexibility when writing the content. The TokenStream framework makes it very easy to ignore line breaks and whitespace. To do so, the Tokenizer implementation must simply not include the line break character sequences and whitespace in the token ranges. Since none of the tokens contain whitespace, the parser never has to deal with them.

Of course, many parsers will require that some whitespace be included. For example, whitespace within a quoted string may be needed by the parser. In this case, the Tokenizer should simply include the whitespace characters in the tokens.

Writing a Tokenizer

Each parser will likely have its own {@link Tokenizer} implementation that contains the parser-specific logic about how tobreak the content into token objects. Generally, the easiest way to do this is to simply iterate through the character sequence passed into the {@link Tokenizer#tokenize(CharacterStream,Tokens) tokenize(...)} method, and use a switch statement to decidewhat to do.

Here is the code for a very basic Tokenizer implementation that ignores whitespace, line breaks and Java-style (multi-line and end-of-line) comments, while constructing single tokens for each quoted string.
```
 public class BasicTokenizer implements Tokenizer { public void tokenize( CharacterStream input, Tokens tokens ) throws ParsingException { while (input.hasNext()) { char c = input.next(); switch (c) { case ' ': case '\t': case '\n': case '\r': // Just skip these whitespace characters ... break; case '-': case '(': case ')': case '{': case '}': case '*': case ',': case ';': case '+': case '%': case '?': case '$': case '[': case ']': case '!': case '<': case '>': case '|': case '=': case ':': tokens.addToken(input.index(), input.index() + 1, SYMBOL); break; case '.': tokens.addToken(input.index(), input.index() + 1, DECIMAL); break; case '\"': case '\"': int startIndex = input.index(); Position startingPosition = input.position(); boolean foundClosingQuote = false; while (input.hasNext()) { c = input.next(); if (c == '\\' && input.isNext('"')) { c = input.next(); // consume the ' character since it is escaped } else if (c == '"') { foundClosingQuote = true; break; } } if (!foundClosingQuote) { throw new ParsingException(startingPosition, "No matching closing double quote found"); } int endIndex = input.index() + 1; // beyond last character read tokens.addToken(startIndex, endIndex, DOUBLE_QUOTED_STRING); break; case '\'': startIndex = input.index(); startingPosition = input.position(); foundClosingQuote = false; while (input.hasNext()) { c = input.next(); if (c == '\\' && input.isNext('\'')) { c = input.next(); // consume the ' character since it is escaped } else if (c == '\'') { foundClosingQuote = true; break; } } if (!foundClosingQuote) { throw new ParsingException(startingPosition, "No matching closing single quote found"); } endIndex = input.index() + 1; // beyond last character read tokens.addToken(startIndex, endIndex, SINGLE_QUOTED_STRING); break; case '/': startIndex = input.index(); if (input.isNext('/')) { // End-of-line comment ... boolean foundLineTerminator = false; while (input.hasNext()) { c = input.next(); if (c == '\n' || c == '\r') { foundLineTerminator = true; break; } } endIndex = input.index(); // the token won't include the '\n' or '\r' character(s) if (!foundLineTerminator) ++endIndex; // must point beyond last char if (c == '\r' && input.isNext('\n')) input.next(); if (useComments) { tokens.addToken(startIndex, endIndex, COMMENT); } } else if (input.isNext('*')) { // Multi-line comment ... while (input.hasNext() && !input.isNext('*', '/')) { c = input.next(); } if (input.hasNext()) input.next(); // consume the '*' if (input.hasNext()) input.next(); // consume the '/' if (useComments) { endIndex = input.index() + 1; // the token will include the '/' and '*' characters tokens.addToken(startIndex, endIndex, COMMENT); } } else { // just a regular slash ... tokens.addToken(startIndex, startIndex + 1, SYMBOL); } break; default: startIndex = input.index(); // Read until another whitespace/symbol/decimal/slash is found while (input.hasNext() && !(input.isNextWhitespace() || input.isNextAnyOf("/.-(){}*,;+%?$[]!<>|=:"))) { c = input.next(); } endIndex = input.index() + 1; // beyond last character that was included tokens.addToken(startIndex, endIndex, WORD); } } } } 
```
Tokenizers with exactly this behavior can actually be created using the {@link #basicTokenizer(boolean)} method. So while this verybasic implementation is not meant to be used in all situations, it may be useful in some situations.
org.modeshape.common.text.TokenStream
e the columns empty to signal wildcard } else { // Read names until we see a ',' do { String columnName = tokens.consume(); if (tokens.canConsume("AS")) { String columnAlias = tokens.consume(); columns.add(new Column(columnName, columnAlias)); } else { columns.add(new Column(columnName, null)); } } while (tokens.canConsume(',')); } return columns; } protected Delete parseDelete( TokenStream tokens ) throws ParsingException { tokens.consume("DELETE", "FROM"); String tableName = tokens.consume(); tokens.consume("WHERE"); String lhs = tokens.consume(); tokens.consume('='); String rhs = tokens.consume(); return new Delete(tableName, new Criteria(lhs, rhs)); } } public abstract class Statement { ... } public class Query extends Statement { ... } public class Delete extends Statement { ... } public class Column { ... } This example shows an idiomatic way of writing a parser that is stateless and thread-safe. The parse(...) method takes the input as a parameter, and returns the domain-specific representation that resulted from the parsing. All other methods are utility methods that simply encapsulate common logic or make the code more readable.

In the example, the parse(...) first creates a TokenStream object (using a Tokenizer implementation that is not shown), and then loops as long as there are more tokens to read. As it loops, if the next token is "SELECT", the parser calls the parseSelect(...) method which immediately consumes a "SELECT" token, the names of the columns separated by commas (or a '*' if there all columns are to be selected), a "FROM" token, and the name of the table being queried. The parseSelect(...) method returns a Select object, which then added to the list of statements in the parse(...) method. The parser handles the "DELETE" statements in a similar manner.

Case sensitivity

Very often grammars to not require the case of keywords to match. This can make parsing a challenge, because all combinations of case need to be used. The TokenStream framework provides a very simple solution that requires no more effort than providing a boolean parameter to the constructor.

When a false value is provided for the the caseSensitive parameter, the TokenStream performs all matching operations as if each token's value were in uppercase only. This means that the arguments supplied to the match(...), canConsume(...), and consume(...) methods should be upper-cased. Note that the actual value of each token remains the actual case as it appears in the input.

Of course, when the TokenStream is created with a true value for the caseSensitive parameter, the matching is performed using the actual value as it appears in the input content

Whitespace

Many grammars are independent of lines breaks or whitespace, allowing a lot of flexibility when writing the content. The TokenStream framework makes it very easy to ignore line breaks and whitespace. To do so, the Tokenizer implementation must simply not include the line break character sequences and whitespace in the token ranges. Since none of the tokens contain whitespace, the parser never has to deal with them.

Of course, many parsers will require that some whitespace be included. For example, whitespace within a quoted string may be needed by the parser. In this case, the Tokenizer should simply include the whitespace characters in the tokens.

Writing a Tokenizer

Each parser will likely have its own {@link Tokenizer} implementation that contains the parser-specific logic about how tobreak the content into token objects. Generally, the easiest way to do this is to simply iterate through the character sequence passed into the {@link Tokenizer#tokenize(CharacterStream,Tokens) tokenize(...)} method, and use a switch statement to decidewhat to do.

Here is the code for a very basic Tokenizer implementation that ignores whitespace, line breaks and Java-style (multi-line and end-of-line) comments, while constructing single tokens for each quoted string.
```
 public class BasicTokenizer implements Tokenizer { public void tokenize( CharacterStream input, Tokens tokens ) throws ParsingException { while (input.hasNext()) { char c = input.next(); switch (c) { case ' ': case '\t': case '\n': case '\r': // Just skip these whitespace characters ... break; case '-': case '(': case ')': case '{': case '}': case '*': case ',': case ';': case '+': case '%': case '?': case '$': case '[': case ']': case '!': case '<': case '>': case '|': case '=': case ':': tokens.addToken(input.index(), input.index() + 1, SYMBOL); break; case '.': tokens.addToken(input.index(), input.index() + 1, DECIMAL); break; case '\"': case '\"': int startIndex = input.index(); Position startingPosition = input.position(); boolean foundClosingQuote = false; while (input.hasNext()) { c = input.next(); if (c == '\\' && input.isNext('"')) { c = input.next(); // consume the ' character since it is escaped } else if (c == '"') { foundClosingQuote = true; break; } } if (!foundClosingQuote) { throw new ParsingException(startingPosition, "No matching closing double quote found"); } int endIndex = input.index() + 1; // beyond last character read tokens.addToken(startIndex, endIndex, DOUBLE_QUOTED_STRING); break; case '\'': startIndex = input.index(); startingPosition = input.position(); foundClosingQuote = false; while (input.hasNext()) { c = input.next(); if (c == '\\' && input.isNext('\'')) { c = input.next(); // consume the ' character since it is escaped } else if (c == '\'') { foundClosingQuote = true; break; } } if (!foundClosingQuote) { throw new ParsingException(startingPosition, "No matching closing single quote found"); } endIndex = input.index() + 1; // beyond last character read tokens.addToken(startIndex, endIndex, SINGLE_QUOTED_STRING); break; case '/': startIndex = input.index(); if (input.isNext('/')) { // End-of-line comment ... boolean foundLineTerminator = false; while (input.hasNext()) { c = input.next(); if (c == '\n' || c == '\r') { foundLineTerminator = true; break; } } endIndex = input.index(); // the token won't include the '\n' or '\r' character(s) if (!foundLineTerminator) ++endIndex; // must point beyond last char if (c == '\r' && input.isNext('\n')) input.next(); if (useComments) { tokens.addToken(startIndex, endIndex, COMMENT); } } else if (input.isNext('*')) { // Multi-line comment ... while (input.hasNext() && !input.isNext('*', '/')) { c = input.next(); } if (input.hasNext()) input.next(); // consume the '*' if (input.hasNext()) input.next(); // consume the '/' if (useComments) { endIndex = input.index() + 1; // the token will include the '/' and '*' characters tokens.addToken(startIndex, endIndex, COMMENT); } } else { // just a regular slash ... tokens.addToken(startIndex, startIndex + 1, SYMBOL); } break; default: startIndex = input.index(); // Read until another whitespace/symbol/decimal/slash is found while (input.hasNext() && !(input.isNextWhitespace() || input.isNextAnyOf("/.-(){}*,;+%?$[]!<>|=:"))) { c = input.next(); } endIndex = input.index() + 1; // beyond last character that was included tokens.addToken(startIndex, endIndex, WORD); } } } } 
```
Tokenizers with exactly this behavior can actually be created using the {@link #basicTokenizer(boolean)} method. So while this verybasic implementation is not meant to be used in all situations, it may be useful in some situations.

Examples of org.apache.lucene.analysis.TokenStream

    return mmsegTokenizer;
  }


  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = new MMSegTokenizer(newSeg(), reader);
    return ts;
  }

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

        String line = null;
        long start = System.currentTimeMillis();
        while((line = reader.readLine()) != null) {
          bw.append("--------------------------").append("\r\n");;
          bw.append(line).append("\r\n");
          TokenStream ts = analyzer.tokenStream("text", new StringReader(line));
          for(Token t= new Token(); (t=TokenUtils.nextToken(ts, t)) !=null;) {
            bw.append(new String(t.termBuffer(), 0, t.termLength())).append(" | ");
          }
          bw.append("\r\n");
        }

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

    for(int i=0; i<n; i++) {
      for(File txt : txts) {
        FileInputStream ftxt = new FileInputStream(txt);
        int s = ftxt.available();
        size += s;
        TokenStream ts = analyzer.tokenStream("text", new InputStreamReader(ftxt));
        OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(new File(txt.getAbsoluteFile()+"."+outputChipName+".word")));
        BufferedWriter bw = new BufferedWriter(osw);
        long start = System.currentTimeMillis();
        for(Token t= new Token(); (t=TokenUtils.nextToken(ts, t)) !=null;) {
          bw.append(new String(t.term())).append("\r\n");

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

  public void testChineseAnalyzer() throws IOException {
    Token nt = new Token();
    Analyzer ca = new SmartChineseAnalyzer(true);
    Reader sentence = new StringReader("我购买了道具和服装。");
    String[] result = { "我", "购买", "了", "道具", "和", "服装" };
    TokenStream ts = ca.tokenStream("sentence", sentence);
    int i = 0;
    nt = ts.next(nt);
    while (nt != null) {
      assertEquals(result[i], nt.term());
      i++;
      nt = ts.next(nt);
    }
    ts.close();
  }

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

        "我从小就不由自主地认为自己长大以后一定得成为一个象我父亲一样的画家, 可能是父母潜移默化的影响。其实我根本不知道作为画家意味着什么，我是否喜欢，最重要的是否适合我，我是否有这个才华。其实人到中年的我还是不确定我最喜欢什么，最想做的是什么？我相信很多人和我一样有同样的烦恼。毕竟不是每个人都能成为作文里的宇航员，科学家和大教授。知道自己适合做什么，喜欢做什么，能做好什么其实是个非常困难的问题。"
            + "幸运的是，我想我的孩子不会为这个太过烦恼。通过老大，我慢慢发现美国高中的一个重要功能就是帮助学生分析他们的专长和兴趣，从而帮助他们选择大学的专业和未来的职业。我觉得帮助一个未成形的孩子找到她未来成长的方向是个非常重要的过程。"
            + "美国高中都有专门的职业顾问，通过接触不同的课程，和各种心理，个性，兴趣很多方面的问答来帮助每个学生找到最感兴趣的专业。这样的教育一般是要到高年级才开始， 可老大因为今年上计算机的课程就是研究一个职业走向的软件项目，所以她提前做了这些考试和面试。看来以后这样的教育会慢慢由电脑来测试了。老大带回家了一些试卷，我挑出一些给大家看看。这门课她花了2个多月才做完，这里只是很小的一部分。"
            + "在测试里有这样的一些问题："
            + "你是个喜欢动手的人吗？ 你喜欢修东西吗？你喜欢体育运动吗？你喜欢在室外工作吗？你是个喜欢思考的人吗？你喜欢数学和科学课吗？你喜欢一个人工作吗？你对自己的智力自信吗？你的创造能力很强吗？你喜欢艺术，音乐和戏剧吗？  你喜欢自由自在的工作环境吗？你喜欢尝试新的东西吗？ 你喜欢帮助别人吗？你喜欢教别人吗？你喜欢和机器和工具打交道吗？你喜欢当领导吗？你喜欢组织活动吗？你什么和数字打交道吗？");
    TokenStream ts = ca.tokenStream("sentence", sentence);


    System.out.println("start: " + (new Date()));
    long before = System.currentTimeMillis();
    nt = ts.next(nt);
    while (nt != null) {
      System.out.println(nt.term());
      nt = ts.next(nt);
    }
    ts.close();
    long now = System.currentTimeMillis();
    System.out.println("time: " + (now - before) / 1000.0 + " s");
  }

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

    this.stopWords = stopWords;
    wordSegment = new WordSegmenter();
  }


  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new SentenceTokenizer(reader);
    result = new WordTokenizer(result, wordSegment);
    // result = new LowerCaseFilter(result);
    // 不再需要LowerCaseFilter，因为SegTokenFilter已经将所有英文字符转换成小写
    // stem太严格了, This is not bug, this feature:)
    result = new PorterStemFilter(result);

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

   *         {@link org.apache.lucene.document.Document}
   * @throws IOException if there was an error loading
   */
  public static TokenStream getAnyTokenStream(IndexReader reader, int docId,
      String field, Document doc, Analyzer analyzer) throws IOException {
    TokenStream ts = null;


    TermFreqVector tfv = reader.getTermFreqVector(docId, field);
    if (tfv != null) {
      if (tfv instanceof TermPositionVector) {
        ts = getTokenStream((TermPositionVector) tfv);

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

   * @return null if field not stored correctly
   * @throws IOException
   */
  public static TokenStream getAnyTokenStream(IndexReader reader, int docId,
      String field, Analyzer analyzer) throws IOException {
    TokenStream ts = null;


    TermFreqVector tfv = reader.getTermFreqVector(docId, field);
    if (tfv != null) {
      if (tfv instanceof TermPositionVector) {
        ts = getTokenStream((TermPositionVector) tfv);

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

        TopDocs hits = indexSearcher.search(query, 1);
        assertEquals(1, hits.totalHits);
        final Highlighter highlighter = new Highlighter(
            new SimpleHTMLFormatter(), new SimpleHTMLEncoder(),
            new QueryScorer(query));
        final TokenStream tokenStream = TokenSources
            .getTokenStream(
                (TermPositionVector) indexReader.getTermFreqVector(0, FIELD),
                false);
        assertEquals("<B>the fox</B> did not jump",
            highlighter.getBestFragment(tokenStream, TEXT));

View Full Code Here

Examples of org.apache.lucene.analysis.TokenStream

        TopDocs hits = indexSearcher.search(query, 1);
        assertEquals(1, hits.totalHits);
        final Highlighter highlighter = new Highlighter(
            new SimpleHTMLFormatter(), new SimpleHTMLEncoder(),
            new QueryScorer(query));
        final TokenStream tokenStream = TokenSources
            .getTokenStream(
                (TermPositionVector) indexReader.getTermFreqVector(0, FIELD),
                false);
        assertEquals("<B>the fox</B> did not jump",
            highlighter.getBestFragment(tokenStream, TEXT));

View Full Code Here

0 1 2 3 4 5 6

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.

Examples of TokenStream

Case sensitivity

Whitespace

Writing a Tokenizer

Case sensitivity

Whitespace

Writing a Tokenizer

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream

Examples of org.apache.lucene.analysis.TokenStream