Examples of HtmlParser

appl.Portal.Utils.LinkSearch.HtmlParser
br.com.caelum.tubaina.parser.html.HtmlParser
br.com.caelum.tubaina.parser.html.desktop.HtmlParser
cn.edu.hfut.dmic.webcollector.parser.HtmlParser
默认的网页解析器 @author hu
com.flaptor.util.parser.HtmlParser
com.google.dart.engine.html.parser.HtmlParser
Instances of the class {@code HtmlParser} are used to parse tokens into a AST structure comprisedof {@link XmlNode}s. @coverage dart.engine.html
com.google.gwt.thirdparty.streamhtmlparser.HtmlParser
com.salas.bb.utils.htmlparser.HtmlParser
Simplpified and fast parser of HTML that detects text, tags and entities separately.
com.scraper.parser.HTMLParser
com.substanceofcode.utils.HTMLParser
Simple and lightweight HTML parser without complete error handling. @author Irving Bunton
de.mhus.lib.parser.HtmlParser
@author hummel
de.spotnik.util.html.HTMLParser
HTMLParser. @author Jens Rehp�hler @since 26.08.2006
edu.stanford.nlp.web.HTMLParser
Parses an HTML document and returns the plain text (and title). The main thing that HTMLParser is used for is the parse(String url) method, which will return a String with the contents of an HTML page, without the tags. After calling parse, you can get the HTML title (contents of the TITLE tag) by calling title(). Subclasses may override the handleText(), handleComment(), handleStartTag(), etc. methods so that parse(String url) returns something other than the text of the web page. (For example, one may be interested in returning only part of the text, or only the links.) @author Sepandar Kamvar (sdkamvar@stanford.edu)
nu.validator.htmlparser.sax.HtmlParser
This class implements an HTML5 parser that exposes data through the SAX2 interface.
By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy to ALLOW. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.
By default, this parser doesn't do true streaming but buffers everything first. The parser can be made truly streaming by calling setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL). This has the consequence that errors that require non-streamable recovery are treated as fatal.
By default, in order to make the parse events emulate the parse events for a DTDless XML document, the parser does not report the doctype through LexicalHandler. Doctype reporting through LexicalHandler can be turned on by calling setReportingDoctype(true). @version $Id$ @author hsivonen
org.ajax4jsf.webapp.HtmlParser
org.apache.droids.parse.html.HtmlParser
@version 1.0
org.apache.jmeter.protocol.http.parser.HTMLParser
HtmlParsers can parse HTML content to obtain URLs.
org.apache.lenya.lucene.html.HTMLParser
HTML Parser
org.apache.lenya.lucene.parser.HTMLParser
org.apache.lucene.demo.html.HTMLParser
org.apache.nutch.parse.html.HtmlParser
org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlParser
HtmlParser.java @author Walter Kasper
org.apache.tika.parser.html.HtmlParser
HTML parser. Uses TagSoup to turn the input document to HTML SAX events, and post-processes the events to produce XHTML and metadata expected by Tika clients.
org.jasen.interfaces.HTMLParser

Parses the HTML part of a message.
@author Jason Polites
org.lobobrowser.html.parser.HtmlParser
rabbit.html.HtmlParser
This is a class that is used to parse a block of HTML code into separate tokens. This parser uses a recursive descent approach. @author Robert Olofsson
railo.runtime.search.lucene2.html.HTMLParser
saveReddit.parser.htmlParser
uk.ac.ucl.panda.utility.parser.HTMLParser
HTML Parsing Interfacew for test purposes
vmcreative.htmlparser.HTMLParser

Examples of org.apache.tika.parser.html.HtmlParser

            fetcherPolicy.setFetcherMode(FetcherMode.EFFICIENT);
            
            // We only care about mime types that the Tika HTML parser can handle,
            // so restrict it to the same.
            Set<String> validMimeTypes = new HashSet<String>();
            Set<MediaType> supportedTypes = new HtmlParser().getSupportedTypes(new ParseContext());
            for (MediaType supportedType : supportedTypes) {
                validMimeTypes.add(String.format("%s/%s", supportedType.getType(), supportedType.getSubtype()));
            }
            fetcherPolicy.setValidMimeTypes(validMimeTypes);

View Full Code Here

Examples of org.apache.tika.parser.html.HtmlParser

    InputStream input = new ByteArrayInputStream(html.getBytes(Charset.forName("UTF-8")));
    ContentHandler text = new BodyContentHandler();//<co id="html.text.co"/>
    LinkContentHandler links = new LinkContentHandler();//<co id="html.link.co"/>
    ContentHandler handler = new TeeContentHandler(links, text);//<co id="html.merge"/>
    Metadata metadata = new Metadata();//<co id="html.store"/>
    Parser parser = new HtmlParser();//<co id="html.parser"/>
    ParseContext context = new ParseContext();
    parser.parse(input, handler, metadata, context);//<co id="html.parse"/>
    System.out.println("Title: " + metadata.get(Metadata.TITLE));
    System.out.println("Body: " + text.toString());
    System.out.println("Links: " + links.getLinks());
    /*
    <calloutlist>

View Full Code Here

Examples of org.apache.tika.parser.html.HtmlParser

     */
    private String extract(byte[] byteObject) throws TikaException {// throws IOException
        StringBuilder wBuf = new StringBuilder();
        InputStream stream = null;
        Metadata metadata = new Metadata();
        HtmlParser htmlParser = new HtmlParser();
        BodyContentHandler handler = new BodyContentHandler(-1);// -1
        ParseContext parser = new ParseContext();
        try {
            stream = new ByteArrayInputStream(byteObject);
            htmlParser.parse(stream, handler, metadata, parser);
            wBuf.append(handler.toString()
                    + System.getProperty("line.separator"));
        } catch (SAXException e) {
            throw new RuntimeException(e);
        } catch (IOException e) {

View Full Code Here

Examples of org.apache.tika.parser.html.HtmlParser

  @SuppressWarnings("serial")
  public static AutoDetectParser createParser() {
    final AutoDetectParser parser = new AutoDetectParser();


    Map<MediaType,Parser> parsers = parser.getParsers();
    parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
    parser.setParsers(parsers);


    parser.setFallback(new Parser() {
      public Set<MediaType> getSupportedTypes(ParseContext parseContext) {
        return parser.getSupportedTypes(parseContext);

View Full Code Here

Examples of org.apache.tika.parser.html.HtmlParser

        StringTokenizer tokenizer = new StringTokenizer(classes, ", \t\n\r\f");
        while (tokenizer.hasMoreTokens()) {
            String name = tokenizer.nextToken();
            if (name.equals(
                    "org.apache.jackrabbit.extractor.HTMLTextExtractor")) {
                parsers.put(MediaType.text("html"), new HtmlParser());
            } else if (name.equals("org.apache.jackrabbit.extractor.MsExcelTextExtractor")) {
                Parser parser = new OfficeParser();
                parsers.put(MediaType.application("vnd.ms-excel"), parser);
                parsers.put(MediaType.application("msexcel"), parser);
                parsers.put(MediaType.application("excel"), parser);

View Full Code Here

Examples of org.apache.tika.parser.html.HtmlParser

        StringTokenizer tokenizer = new StringTokenizer(classes, ", \t\n\r\f");
        while (tokenizer.hasMoreTokens()) {
            String name = tokenizer.nextToken();
            if (name.equals(
                    "org.apache.jackrabbit.extractor.HTMLTextExtractor")) {
                parsers.put("text/html", new HtmlParser());
            } else if (name.equals(
                    "org.apache.jackrabbit.extractor.MsExcelTextExtractor")) {
                Parser parser = new OfficeParser();
                parsers.put("application/vnd.ms-excel", parser);
                parsers.put("application/msexcel", parser);

View Full Code Here

Examples of org.apache.tika.parser.html.HtmlParser

      Multipart mp = (Multipart) p.getContent();
      int count = mp.getCount();
      for (int i = 0; i < count; i++)
        content.append(getContentFromHTML(mp.getBodyPart(i)));
    } else if (p.isMimeType("text/html")) {
      HtmlParser parser = new HtmlParser();
      Metadata met = new Metadata();
      TextContentHandler handler = new TextContentHandler(
          new BodyContentHandler());
      parser.parse(new ByteArrayInputStream(((String) p.getContent())
          .getBytes()), handler, met);
      content.append(handler.toString());
    } else {
      Object obj = p.getContent();
      if (obj instanceof Part)

View Full Code Here

Examples of org.apache.tika.parser.html.HtmlParser

        // Read in link
        StringBuffer sb = new StringBuffer("");
        while (scanner.hasNext())
          sb.append(scanner.nextLine());


        HtmlParser parser = new HtmlParser();
        Metadata met = new Metadata();
        LinkContentHandler handler = new LinkContentHandler();


        parser.parse(new ByteArrayInputStream(sb.toString().getBytes()),
            handler, met);
        List<Link> links = handler.getLinks();
        children = new LinkedList<ProtocolFile>();
        for (Link link : links) {
          String href = link.getUri();

View Full Code Here

Examples of org.apache.tika.parser.html.HtmlParser


  public static AutoDetectParser createParser() {
    final AutoDetectParser parser = new AutoDetectParser();


    Map<MediaType,Parser> parsers = parser.getParsers();
    parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
    parser.setParsers(parsers);


    parser.setFallback(new Parser() {
      public Set<MediaType> getSupportedTypes(ParseContext parseContext) {
        return parser.getSupportedTypes(parseContext);

View Full Code Here

Examples of org.apache.tika.parser.html.HtmlParser

                 data = ((ByteChunk)htmlChunk).getValue();
              } else if(htmlChunk instanceof StringChunk) {
                 data = ((StringChunk)htmlChunk).getRawValue();
              }
              if(data != null) {
                 HtmlParser htmlParser = new HtmlParser();
                 htmlParser.parse(
                       new ByteArrayInputStream(data),
                       new EmbeddedContentHandler(new BodyContentHandler(xhtml)), 
                       new Metadata(), new ParseContext()
                 );
                 doneBody = true;

View Full Code Here

0 1 2 3 4 5 6

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.