Examples of Tokenizer

anvil.script.parser.Tokenizer
cambridge.parser.Tokenizer
ch.pollet.jzic.tokenizer.Tokenizer
@author Christophe Pollet
com.aliasi.tokenizer.Tokenizer
com.esotericsoftware.yamlbeans.tokenizer.Tokenizer
Interprets a YAML document as a stream of tokens. @author Nathan Sweet @author Ola Bini
com.googlecode.goclipse.go.lang.lexer.Tokenizer
Provides more specific classification of a given token and passes it on to listeners for further analysis and processing.
com.googlecode.psiprobe.tokenizer.Tokenizer
com.hp.hpl.jena.util.Tokenizer
A tokenizer, similar to java's StringTokenizer but allows for quoted character strings which can include other separators. @author Dave Reynolds
com.intellij.spellchecker.tokenizer.Tokenizer
com.metaweb.lessen.tokenizers.Tokenizer
com.sun.enterprise.admin.util.Tokenizer
com.sun.enterprise.module.common_impl.Tokenizer
JDK5-friendly string tokenizer. @author Kohsuke Kawaguchi
com.sun.speech.freetts.Tokenizer
Chops a string or text file into Token instances.
edu.buffalo.cse.ir.wikiindexer.tokenizer.Tokenizer
This is the main Tokenizer class. It simply calls one TokenizerRule after another All operations can be assumed to be thread safe. @author nikhillo
edu.harvard.wcfia.yoshikoder.document.tokenizer.Tokenizer
@author will
edu.isi.karma.cleaning.Tokenizer
edu.stanford.nlp.process.Tokenizer
Tokenizers break up text into individual Objects. These objects may be Strings, Words, or other Objects. A Tokenizer extends the Iterator interface, but provides a lookahead operation peek(). An implementation of this interface is expected to have a constructor that takes a single argument, a Reader. @author Teg Grenager (grenager@stanford.edu)
edu.umd.cs.findbugs.Tokenizer
A simple tokenizer for Java source text. This is not intended to be a compliant lexer; instead, it is for quick and dirty scanning. @author David Hovemeyer @see Token
es.upv.simulator.Tokenizer

Title: Virtual Oscilloscope.

Description: A Oscilloscope simulator

Copyright (C) 2003 José Manuel Gómez Soriano

License

This file is part of Virtual Oscilloscope. Virtual Oscilloscope is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Virtual Oscilloscope is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Virtual Oscilloscope; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
ivory.core.tokenize.Tokenizer
net.percederberg.grammatica.parser.Tokenizer
A character stream tokenizer. This class groups the characters read from the stream together into tokens ("words"). The grouping is controlled by token patterns that contain either a fixed string to search for, or a regular expression. If the stream of characters don't match any of the token patterns, a parse exception is thrown. @author Per Cederberg, @version 1.5
net.sf.collabreview.transform.tokened.Tokenizer
The Tokenizer takes a string and splits it into tokens. However, the Tokenizer is a bit different from normal Tokenizers because of the special nature of the Tokens that the Tokenizer creates from the tokenized string. A Token consists of two parts: a core and a rim. The core is the actual normalized content of a Token while the rim is irrelevant noise surrounding and permeating the core.
The Tokens owned by the Tokenizer completely cover the entire original string. Thus for CollabReview a Token is e.g. a complete line of source code with indentation being its rim.
All (except the last) Tokens must also include the Token terminator character/string. As the last Token is usually not terminated the Tokenizer must supply a virtual terminating string with the respective method. @author Christian Prause (chris) @date 2009-05-25 16:57:16
net.sf.mpxj.utility.Tokenizer
This class implements a tokenizer based loosely on java.io.StreamTokenizer. This tokenizer is designed to parse records from an MPX file correctly. In particular it will handle empty fields, represented by adjacent field delimiters.
net.sf.saxon.expr.Tokenizer
Tokenizer for expressions and inputs. This code was originally derived from James Clark's xt, though it has been greatly modified since. See copyright notice at end of file.
net.sourceforge.jdbclogger.core.util.Tokenizer
@author Martin Marinschek (latest modification by $Author: catalean $) @version $Revision: 83 $ $Date: 2007-07-08 05:00:58 +0800 (周日, 2007-07-08) $
net.sourceforge.pmd.cpd.Tokenizer
@since 2.2
nu.validator.htmlparser.impl.Tokenizer
hatwg.org/specs/web-apps/current-work/multipage/tokenization.html This class implements the Locator interface. This is not an incidental implementation detail: Users of this class are encouraged to make use of the Locator nature. By default, the tokenizer may report data that XML 1.0 bans. The tokenizer can be configured to treat these conditions as fatal or to coerce the infoset to something that XML 1.0 allows. @version $Id$ @author hsivonen
opennlp.ccg.lexicon.Tokenizer
The Tokenizer interface provides a way to customize tokenization and handling of special tokens. A custom tokenizer may be specified in the grammar file. DefaultTokenizer provides a default implementation, which can also be subclassed for custom behavior. @author Michael White @version $Revision: 1.14 $, $Date: 2005/10/20 17:30:30 $
opennlp.tools.tokenize.Tokenizer
The interface for tokenizers, which segment a string into its tokens.
Tokenization is a necessary step before more complex NLP tasks can be applied, these usually process text on a token level. The quality of tokenization is important because it influences the performance of high-level task applied to it.
In segmented languages like English most words are segmented by white spaces expect for punctuations, etc. which is directly attached to the word without a white space in between, it is not possible to just split at all punctuations because in abbreviations dots are a part of the token itself. A tokenizer is now responsible to split these tokens correctly.
In non-segmented languages like Chinese tokenization is more difficult since words are not segmented by a whitespace.
Tokenizers can also be used to segment already identified tokens further into more atomic parts to get a deeper understanding. This approach helps more complex task to gain insight into tokens which do not represent words like numbers, units or tokens which are part of a special notation.
For most further task it is desirable to over tokenize rather than under tokenize.
org.apache.cocoon.util.Tokenizer
Replacement for StringTokenizer in java.util, beacuse of bug in the Sun's implementation. @author Peter Moravek
org.apache.ctakes.core.nlp.tokenizer.Tokenizer
A class used to break natural text into tokens. The token markup is external to the text and is not embedded like XML. Character offset location is used to identify the boundaries of a token. @author Mayo Clinic
org.apache.felix.gogo.runtime.Tokenizer
ts a line comment, /* starts a block comment. The following common uses do NOT start comments: ls http://example.com#anchor ls $dir/*.java @see http://wiki.bash-hackers.org/syntax/basicgrammar
org.apache.hadoop.hbase.codec.prefixtree.encode.tokenize.Tokenizer
Data structure used in the first stage of PrefixTree encoding:
accepts a sorted stream of ByteRanges
splits them into a set of tokens, each held by a {@link TokenizerNode}
connects the TokenizerNodes via standard java references
keeps a pool of TokenizerNodes and a reusable byte[] for holding all token content

Mainly used for turning Cell rowKeys into a trie, but also used for family and qualifier encoding.


        // load grammar
        URL grammarURL = new File(grammarfile).toURI().toURL();
        System.out.println("Loading grammar from URL: " + grammarURL);
        Grammar grammar = new Grammar(grammarURL);
        Tokenizer tokenizer = grammar.lexicon.tokenizer;
        System.out.println();
        
        // set up parser
        Parser parser = new Parser(grammar);
        // instantiate scorer
        try {
            System.out.println("Instantiating parsing sign scorer from class: " + parseScorerClass);
            SignScorer parseScorer = (SignScorer) Class.forName(parseScorerClass).newInstance();
            parser.setSignScorer(parseScorer);
            System.out.println();
        } catch (Exception exc) {
            throw (RuntimeException) new RuntimeException().initCause(exc);
        }
        // instantiate supertagger
        try {
          Supertagger supertagger;
          if (supertaggerClass != null) {
                System.out.println("Instantiating supertagger from class: " + supertaggerClass);
                supertagger = (Supertagger) Class.forName(supertaggerClass).newInstance();
          }
          else {
            System.out.println("Instantiating supertagger from config file: " + stconfig);
            supertagger = WordAndPOSDictionaryLabellingStrategy.supertaggerFactory(stconfig);
          }
            parser.setSupertagger(supertagger);
            System.out.println();
        } catch (Exception exc) {
            throw (RuntimeException) new RuntimeException().initCause(exc);
        }
        
        // loop through input
        BufferedReader in = new BufferedReader(new FileReader(inputfile));
        String line;
        Map<String,String> predInfoMap = new HashMap<String,String>();
        System.out.println("Parsing " + inputfile);
        System.out.println();
        int count = 1;
        while ((line = in.readLine()) != null) {
          String id = "s" + count;
          try {
            // parse it
            System.out.println(line);
      parser.parse(line);
      int numParses = Math.min(nbestListSize, parser.getResult().size());
      for (int i=0; i < numParses; i++) {
          Sign thisParse = parser.getResult().get(i);
          // convert lf
          Category cat = thisParse.getCategory();
          LF convertedLF = null;
          String predInfo = null;
          if (cat.getLF() != null) {
        // convert LF
        LF flatLF = cat.getLF();
        cat = cat.copy();
        Nominal index = cat.getIndexNominal(); 
        convertedLF = HyloHelper.compactAndConvertNominals(flatLF, index, thisParse);
        // get pred info
        predInfoMap.clear();
        Testbed.extractPredInfo(flatLF, predInfoMap);
        predInfo = Testbed.getPredInfo(predInfoMap);
          }
          // add test item, sign
          Element item = RegressionInfo.makeTestItem(grammar, line, 1, convertedLF);
          String actualID = (nbestListSize == 1) ? id : id + "-" + (i+1);
          item.setAttribute("info", actualID);
          outRoot.addContent(item);
          signMap.put(actualID, thisParse);
          // Add parsed words as a separate LF element
          Element fullWordsElt = new Element("full-words");
          fullWordsElt.addContent(tokenizer.format(thisParse.getWords()));
          item.addContent(fullWordsElt);
          if (predInfo != null) {
        Element predInfoElt = new Element("pred-info");
        predInfoElt.setAttribute("data", predInfo);
        item.addContent(predInfoElt);

Examples of Tokenizer

License

Examples of nu.validator.htmlparser.impl.Tokenizer

Examples of opennlp.ccg.lexicon.Tokenizer

Examples of opennlp.tools.tokenize.Tokenizer

Examples of org.apache.cocoon.util.Tokenizer

Examples of org.apache.ctakes.core.nlp.tokenizer.Tokenizer

Examples of org.apache.felix.gogo.runtime.Tokenizer

Examples of org.apache.hadoop.hbase.codec.prefixtree.encode.tokenize.Tokenizer

Examples of org.apache.jena.riot.tokens.Tokenizer

Examples of org.apache.lucene.analysis.Tokenizer

Examples of org.apache.lucene.analysis.Tokenizer