Examples of com.github.pmerienne.trident.ml.preprocessing.EnglishTokenizer$EnglishSpecialAnalyzer

Package com.github.pmerienne.trident.ml.preprocessing

Examples of com.github.pmerienne.trident.ml.preprocessing.EnglishTokenizer$EnglishSpecialAnalyzer

com.github.pmerienne.trident.ml.preprocessing.EnglishTokenizer
s.apache.org/jira/browse/LUCENE-3335">Issue LUCENE-3335 causing sigsegv! @author pmerienne

  private final static String LILIUM_WIKI = "Lilium (members of which are true lilies) is a genus of herbaceous flowering plants growing from bulbs, all with large prominent flowers. Lilies are a group of flowering plants which are important in culture and literature in much of the world. Most species are native to the temperate northern hemisphere, though their range extends into the northern subtropics. Many other plants have \"lily\" in their common name but are not related to true lilies.";
  private final static String ROSE_WIKI = "A rose is a woody perennial of the genus Rosa, within the family Rosaceae. There are over 100 species. They form a group of plants that can be erect shrubs, climbing or trailing with stems that are often armed with sharp prickles. Flowers vary in size and shape and are usually large and showy, in colours ranging from white through yellows and reds. Most species are native to Asia, with smaller numbers native to Europe, North America, and northwest Africa. Species, cultivars and hybrids are all widely grown for their beauty and often are fragrant. Rose plants range in size from compact, miniature roses, to climbers that can reach 7 meters in height. Different species hybridize easily, and this has been used in the development of the wide range of garden roses.";


  @Test
  public void testWithSmallWiki() {
    EnglishTokenizer tokenizer = new EnglishTokenizer();


    KLDClassifier kldClassifier = new KLDClassifier(2);
    kldClassifier.update(0, tokenizer.tokenize(NOSQL_WIKI));
    kldClassifier.update(0, tokenizer.tokenize(MYSQL_WIKI));
    kldClassifier.update(1, tokenizer.tokenize(LILIUM_WIKI));
    kldClassifier.update(1, tokenizer.tokenize(ROSE_WIKI));


    assertEquals(0, (int) kldClassifier.classify(tokenizer.tokenize(DATABASE_WIKI)));
    assertEquals(1, (int) kldClassifier.classify(tokenizer.tokenize(FLOWER_WIKI)));
  }

View Full Code Here


  @SuppressWarnings("unchecked")
  @Test
  public void test() {
    // Given
    TextTokenizer tokenizer = new EnglishTokenizer();
    List<String> d1 = tokenizer.tokenize(DATABASE_WIKI);
    List<String> d2 = tokenizer.tokenize(NOSQL_WIKI);
    List<String> d3 = tokenizer.tokenize(MYSQL_WIKI);
    List<String> d4 = tokenizer.tokenize(FLOWER_WIKI);
    List<String> d5 = tokenizer.tokenize(LILIUM_WIKI);
    List<String> d6 = tokenizer.tokenize(ROSE_WIKI);
    List<List<String>> training = Arrays.asList(d1, d2, d4, d5);


    TFIDF tfidf = new TFIDF();


    // When

View Full Code Here

public class EnglishTokenizerTest {


  @Test
  public void testTokenize() {
    // Given
    EnglishTokenizer tokenizer = new EnglishTokenizer();
    String text = "I can't argue with some arguments on argus with argues";


    // When
    List<String> actualTokens = tokenizer.tokenize(text);


    // Then
    List<String> expectedTokens = Arrays.asList("i", "can't", "argu", "some", "argument", "argu", "argu");
    assertEquals(expectedTokens, actualTokens);
  }

View Full Code Here

  }


  protected static void loadReutersData() throws IOException {
    REUTERS_SAMPLES = new ArrayList<TextInstance<Integer>>();


    EnglishTokenizer tokenizer = new EnglishTokenizer();
    Map<String, Integer> topics = new HashMap<String, Integer>();


    FileInputStream is = new FileInputStream(REUTEURS_FILE);
    BufferedReader br = new BufferedReader(new InputStreamReader(is));
    try {
      String line;
      while ((line = br.readLine()) != null) {
        try {
          // Get class index
          String topic = line.split(",")[0];
          if (!topics.containsKey(topic)) {
            topics.put(topic, topics.size());
          }
          Integer classIndex = topics.get(topic);


          // Get text
          int startIndex = line.indexOf(" - ");
          String text = line.substring(startIndex, line.length() - 1);


          REUTERS_SAMPLES.add(new TextInstance<Integer>(classIndex, tokenizer.tokenize(text)));
        } catch (Exception ex) {
          System.err.println("Skipped Reuters sample because it can't be parsed : " + line);
        }
      }

View Full Code Here

TOP

Related Classes of com.github.pmerienne.trident.ml.preprocessing.EnglishTokenizer$EnglishSpecialAnalyzer

com.github.pmerienne.trident.ml.nlp.KLDClassifierTest

com.github.pmerienne.trident.ml.nlp.TFIDFTest

com.github.pmerienne.trident.ml.preprocessing.EnglishTokenizerTest

com.github.pmerienne.trident.ml.testing.data.Datasets

org.apache.lucene.analysis.Analyzer

org.apache.lucene.analysis.en.EnglishPossessiveFilter

org.apache.lucene.analysis.KeywordMarkerFilter

org.apache.lucene.analysis.LowerCaseFilter

org.apache.lucene.analysis.shingle.ShingleFilter

org.apache.lucene.analysis.standard.StandardFilter

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.