Examples of net.sf.regain.crawler.preparator.html.HtmlPathExtractor

Package net.sf.regain.crawler.preparator.html

Examples of net.sf.regain.crawler.preparator.html.HtmlPathExtractor

net.sf.regain.crawler.preparator.html.HtmlPathExtractor
Extrahiert aus einem HTML-Dokument den Pfad, über den es zu erreichen ist. @author Til Schneider, www.murfman.de

      String pathEndRegex = (String) sectionArr[i].get("endRegex");
      String pathNodeRegex = (String) sectionArr[i].get("pathNodeRegex");
      int pathNodeUrlGroup = getIntParam(sectionArr[i], "pathNodeRegex.urlGroup");
      int pathNodeTitleGroup = getIntParam(sectionArr[i], "pathNodeRegex.titleGroup");


      mPathExtractorArr[i] = new HtmlPathExtractor(prefix, pathStartRegex,
        pathEndRegex, pathNodeRegex, pathNodeUrlGroup,
        pathNodeTitleGroup);
    }
  }

View Full Code Here

      // Set the headlines
      setHeadlines(headlines);
    }


    // Find the path extractor that is responsible for this document
    HtmlPathExtractor pathExtractor = null;
    if (mPathExtractorArr != null) {
      for (int i = 0; i < mPathExtractorArr.length; i++) {
        if (mPathExtractorArr[i].accepts(rawDocument)) {
          pathExtractor = mPathExtractorArr[i];
        }
      }
    }


    // Extract the path from the document
    if (pathExtractor != null) {
      PathElement[] path = pathExtractor.extractPath(rawDocument);
      setPath(path);
    }
  }

View Full Code Here

TOP

Related Classes of net.sf.regain.crawler.preparator.html.HtmlPathExtractor

net.sf.regain.crawler.document.PathElement

net.sf.regain.crawler.preparator.HtmlPreparator

org.apache.regexp.RE

java.util.ArrayList

net.sf.regain.RegainException

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.