Examples of HTMLParser

appl.Portal.Utils.LinkSearch.HtmlParser
br.com.caelum.tubaina.parser.html.HtmlParser
br.com.caelum.tubaina.parser.html.desktop.HtmlParser
cn.edu.hfut.dmic.webcollector.parser.HtmlParser
默认的网页解析器 @author hu
com.flaptor.util.parser.HtmlParser
com.google.dart.engine.html.parser.HtmlParser
Instances of the class {@code HtmlParser} are used to parse tokens into a AST structure comprisedof {@link XmlNode}s. @coverage dart.engine.html
com.google.gwt.thirdparty.streamhtmlparser.HtmlParser
com.salas.bb.utils.htmlparser.HtmlParser
Simplpified and fast parser of HTML that detects text, tags and entities separately.
com.scraper.parser.HTMLParser
com.substanceofcode.utils.HTMLParser
Simple and lightweight HTML parser without complete error handling. @author Irving Bunton
de.mhus.lib.parser.HtmlParser
@author hummel
de.spotnik.util.html.HTMLParser
HTMLParser. @author Jens Rehp�hler @since 26.08.2006
edu.stanford.nlp.web.HTMLParser
Parses an HTML document and returns the plain text (and title). The main thing that HTMLParser is used for is the parse(String url) method, which will return a String with the contents of an HTML page, without the tags. After calling parse, you can get the HTML title (contents of the TITLE tag) by calling title(). Subclasses may override the handleText(), handleComment(), handleStartTag(), etc. methods so that parse(String url) returns something other than the text of the web page. (For example, one may be interested in returning only part of the text, or only the links.) @author Sepandar Kamvar (sdkamvar@stanford.edu)
nu.validator.htmlparser.sax.HtmlParser
This class implements an HTML5 parser that exposes data through the SAX2 interface.
By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy to ALLOW. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.
By default, this parser doesn't do true streaming but buffers everything first. The parser can be made truly streaming by calling setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL). This has the consequence that errors that require non-streamable recovery are treated as fatal.
By default, in order to make the parse events emulate the parse events for a DTDless XML document, the parser does not report the doctype through LexicalHandler. Doctype reporting through LexicalHandler can be turned on by calling setReportingDoctype(true). @version $Id$ @author hsivonen
org.ajax4jsf.webapp.HtmlParser
org.apache.droids.parse.html.HtmlParser
@version 1.0
org.apache.jmeter.protocol.http.parser.HTMLParser
HtmlParsers can parse HTML content to obtain URLs.
org.apache.lenya.lucene.html.HTMLParser
HTML Parser
org.apache.lenya.lucene.parser.HTMLParser
org.apache.lucene.demo.html.HTMLParser
org.apache.nutch.parse.html.HtmlParser
org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlParser
HtmlParser.java @author Walter Kasper
org.apache.tika.parser.html.HtmlParser
HTML parser. Uses TagSoup to turn the input document to HTML SAX events, and post-processes the events to produce XHTML and metadata expected by Tika clients.
org.jasen.interfaces.HTMLParser

Parses the HTML part of a message.
@author Jason Polites
org.lobobrowser.html.parser.HtmlParser
rabbit.html.HtmlParser
This is a class that is used to parse a block of HTML code into separate tokens. This parser uses a recursive descent approach. @author Robert Olofsson
railo.runtime.search.lucene2.html.HTMLParser
saveReddit.parser.htmlParser
uk.ac.ucl.panda.utility.parser.HTMLParser
HTML Parsing Interfacew for test purposes
vmcreative.htmlparser.HTMLParser

Examples of org.apache.lenya.lucene.html.HTMLParser

     * @return the content of the file.
     * @throws FileNotFoundException if the file does not exists.
     * @throws IOException if something else went wrong.
     */
    protected String readHtmlFile(File file) throws FileNotFoundException, IOException {
        java.io.Reader reader = new HTMLParser(file).getReader();
        char[] chars = new char[1024];
        int chars_read;
        java.io.Writer writer = new java.io.StringWriter();


        while ((chars_read = reader.read(chars)) > 0) {

View Full Code Here

Examples of org.apache.lenya.lucene.html.HTMLParser

        // This field is not stored with document, it is indexed, but it is not
        // tokenized prior to indexing.
        doc.add(new Field("uid", uid(f, htdocsDumpDir), false, true, false));


        //HtmlDocument htmlDoc = new HtmlDocument(f);
        HTMLParser parser = new HTMLParser(f);


        // Add the summary as an UnIndexed field, so that it is stored and returned
        // with hit documents for display.
        // Add the title as a separate Text field, so that it can be searched separately.
        /*
                String title = htmlDoc.getTitle();


                if (title != null) {
                    doc.add(Field.Text("title", title));
                } else {
                    doc.add(Field.Text("title", ""));
                }
        */
        doc.add(Field.Text("title", parser.getTitle()));


        //System.out.println("HTMLDocument.getLuceneDocument(): title field added: " + title);
        // Add the tag-stripped contents as a Reader-valued Text field so it will get tokenized and indexed.
        /*
                String body = htmlDoc.getBody();
                String contents = "";


                if ((body != null) && (title != null)) {
                    contents = title + " " + body;
                    doc.add(Field.Text("contents", title + body));
                }


                doc.add(Field.Text("contents", contents));
        */
        doc.add(Field.Text("contents", parser.getReader()));


        return doc;
    }

View Full Code Here

Examples of org.apache.lenya.lucene.parser.HTMLParser

     * @return DOCUMENT ME!
     *
     * @throws Exception DOCUMENT ME!
     */
    public static String getBodyText(File file) throws Exception {
        HTMLParser parser = HTMLParserFactory.newInstance(file);
        parser.parse(file);


        Reader reader = parser.getReader();
        Writer writer = new StringWriter();


        int c;


        while ((c = reader.read()) != -1)

View Full Code Here

Examples of org.apache.lenya.lucene.parser.HTMLParser

     * @throws Exception DOCUMENT ME!
     */
    public Document getDocument(File file, File htdocsDumpDir) throws Exception {
        Document document = super.getDocument(file, htdocsDumpDir);


        HTMLParser parser = HTMLParserFactory.newInstance(file);
        parser.parse(file);


        document.add(Field.Text("title", parser.getTitle()));
        document.add(Field.Text("keywords", parser.getKeywords()));
        document.add(Field.Text("contents", parser.getReader()));


        return document;
    }

View Full Code Here

Examples of org.apache.lucene.demo.html.HTMLParser

    // Add the uid as a field, so that index can be incrementally maintained.
    // This field is not stored with document, it is indexed, but it is not
    // tokenized prior to indexing.
    doc.add(new Field("uid", uid(f), false, true, false));


    HTMLParser parser = new HTMLParser(f);


    // Add the tag-stripped contents as a Reader-valued Text field so it will
    // get tokenized and indexed.
    doc.add(Field.Text("contents", parser.getReader()));


    // Add the summary as an UnIndexed field, so that it is stored and returned
    // with hit documents for display.
    doc.add(Field.UnIndexed("summary", parser.getSummary()));


    // Add the title as a separate Text field, so that it can be searched
    // separately.
    doc.add(Field.Text("title", parser.getTitle()));


    // return the document
    return doc;
  }

View Full Code Here

Examples of org.apache.lucene.demo.html.HTMLParser

    // 5. skip until end of doc header
    read("</DOCHDR>",null,false,false); 
    // 6. collect until end of doc
    sb = read("</DOC>",null,false,true);
    // this is the next document, so parse it  
    HTMLParser p = new HTMLParser(new StringReader(sb.toString()));
    // title
    String title = p.getTitle();
    // properties 
    Properties props = p.getMetaTags(); 
    // body
    Reader r = p.getReader();
    char c[] = new char[1024];
    StringBuffer bodyBuf = new StringBuffer();
    int n;
    while ((n = r.read(c)) >= 0) {
      if (n>0) {

View Full Code Here

Examples of org.apache.lucene.demo.html.HTMLParser

    // This field is not stored with document, it is indexed, but it is not
    // tokenized prior to indexing.
    doc.add(new Field("uid", uid(f), Field.Store.NO, Field.Index.UN_TOKENIZED));


    FileInputStream fis = new FileInputStream(f);
    HTMLParser parser = new HTMLParser(fis);
      
    // Add the tag-stripped contents as a Reader-valued Text field so it will
    // get tokenized and indexed.
    doc.add(new Field("contents", parser.getReader()));


    // Add the summary as a field that is stored and returned with
    // hit documents for display.
    doc.add(new Field("summary", parser.getSummary(), Field.Store.YES, Field.Index.NO));


    // Add the title as a field that it can be searched and that is stored.
    doc.add(new Field("title", parser.getTitle(), Field.Store.YES, Field.Index.TOKENIZED));


    // return the document
    return doc;
  }

View Full Code Here

Examples of org.apache.nutch.parse.html.HtmlParser

  private Configuration conf;
  private Parser parser;


  public TestHtmlParser() { 
    conf = NutchConfiguration.create();
    parser = new HtmlParser();
    parser.setConf(conf);
  }

View Full Code Here

Examples of org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlParser

        }
        catch (InitializationException e) {
            LOG.error("Registry Initialization Error: " + e.getMessage());
            throw new IOException(e.getMessage());
        }
        parser = new HtmlParser();


    }

View Full Code Here

Examples of org.apache.tika.parser.html.HtmlParser

    super(UrlDatum.FIELDS);
  }


  private synchronized void init() {
    if (_parser == null) {
      _parser = new HtmlParser();
    }
    
    if (_handler == null) {
      _handler = new DefaultHandler() {

View Full Code Here

0 1 2 3 4 5 6

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.