Examples of HtmlParser

appl.Portal.Utils.LinkSearch.HtmlParser
br.com.caelum.tubaina.parser.html.HtmlParser
br.com.caelum.tubaina.parser.html.desktop.HtmlParser
cn.edu.hfut.dmic.webcollector.parser.HtmlParser
默认的网页解析器 @author hu
com.flaptor.util.parser.HtmlParser
com.google.dart.engine.html.parser.HtmlParser
Instances of the class {@code HtmlParser} are used to parse tokens into a AST structure comprisedof {@link XmlNode}s. @coverage dart.engine.html
com.google.gwt.thirdparty.streamhtmlparser.HtmlParser
com.salas.bb.utils.htmlparser.HtmlParser
Simplpified and fast parser of HTML that detects text, tags and entities separately.
com.scraper.parser.HTMLParser
com.substanceofcode.utils.HTMLParser
Simple and lightweight HTML parser without complete error handling. @author Irving Bunton
de.mhus.lib.parser.HtmlParser
@author hummel
de.spotnik.util.html.HTMLParser
HTMLParser. @author Jens Rehp�hler @since 26.08.2006
edu.stanford.nlp.web.HTMLParser
Parses an HTML document and returns the plain text (and title). The main thing that HTMLParser is used for is the parse(String url) method, which will return a String with the contents of an HTML page, without the tags. After calling parse, you can get the HTML title (contents of the TITLE tag) by calling title(). Subclasses may override the handleText(), handleComment(), handleStartTag(), etc. methods so that parse(String url) returns something other than the text of the web page. (For example, one may be interested in returning only part of the text, or only the links.) @author Sepandar Kamvar (sdkamvar@stanford.edu)
nu.validator.htmlparser.sax.HtmlParser
This class implements an HTML5 parser that exposes data through the SAX2 interface.
By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy to ALLOW. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.
By default, this parser doesn't do true streaming but buffers everything first. The parser can be made truly streaming by calling setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL). This has the consequence that errors that require non-streamable recovery are treated as fatal.
By default, in order to make the parse events emulate the parse events for a DTDless XML document, the parser does not report the doctype through LexicalHandler. Doctype reporting through LexicalHandler can be turned on by calling setReportingDoctype(true). @version $Id$ @author hsivonen
org.ajax4jsf.webapp.HtmlParser
org.apache.droids.parse.html.HtmlParser
@version 1.0
org.apache.jmeter.protocol.http.parser.HTMLParser
HtmlParsers can parse HTML content to obtain URLs.
org.apache.lenya.lucene.html.HTMLParser
HTML Parser
org.apache.lenya.lucene.parser.HTMLParser
org.apache.lucene.demo.html.HTMLParser
org.apache.nutch.parse.html.HtmlParser
org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlParser
HtmlParser.java @author Walter Kasper
org.apache.tika.parser.html.HtmlParser
HTML parser. Uses TagSoup to turn the input document to HTML SAX events, and post-processes the events to produce XHTML and metadata expected by Tika clients.
org.jasen.interfaces.HTMLParser

Parses the HTML part of a message.
@author Jason Polites
org.lobobrowser.html.parser.HtmlParser
rabbit.html.HtmlParser
This is a class that is used to parse a block of HTML code into separate tokens. This parser uses a recursive descent approach. @author Robert Olofsson
railo.runtime.search.lucene2.html.HTMLParser
saveReddit.parser.htmlParser
uk.ac.ucl.panda.utility.parser.HTMLParser
HTML Parsing Interfacew for test purposes
vmcreative.htmlparser.HTMLParser

Examples of appl.Portal.Utils.LinkSearch.HtmlParser

        this.setSearchTerm(someSearchWords[0]);


        if(this.buildSearchUrl(someSearchWords)) {
            // create a new instance of the mHtmlParser with
            // the searchenginespecific parameters
            mHtmlParser = new HtmlParser();
            mHtmlParser.setRegExpFrame(mRegExpFrame);
            mHtmlParser.setRegExpItemSet(mRegExpItemSet);
            mHtmlParser.setRegExpItem(mRegExpItem);
            mHtmlParser.setNames(mNames);

View Full Code Here

Examples of br.com.caelum.tubaina.parser.html.HtmlParser

        LOG.warn(e.getMessage());
      }
    }


    if (html) {
      HtmlParser htmlParser = new HtmlParser(conf.read("/regex.properties", "/html.properties"), noAnswer);
      HtmlGenerator generator = new HtmlGenerator(htmlParser, strictXhtml, templateDir);
      File file = new File(outputDir, "html");
      FileUtils.forceMkdir(file);
      try {
        generator.generate(b, file);

View Full Code Here

Examples of br.com.caelum.tubaina.parser.html.desktop.HtmlParser

    public void setUp() throws IOException {
        Configuration cfg = new Configuration();
        cfg.setDirectoryForTemplateLoading(new File(TubainaBuilder.DEFAULT_TEMPLATE_DIR, "kindle"));
        cfg.setObjectWrapper(new BeansWrapper());


        Parser parser = new HtmlParser(new RegexConfigurator().read("/regex.properties",
                "/kindle.properties"));


        partToKindle = new PartToKindle(parser, cfg, new ArrayList<String>());
    }

View Full Code Here

Examples of cn.edu.hfut.dmic.webcollector.parser.HtmlParser

    public Parser createParser(String url, String contentType) throws Exception {
        if (contentType == null) {
            return null;
        }
        if (contentType.contains("text/html")) {
            return new HtmlParser(Config.topN);
        }
        return null;
    }

View Full Code Here

Examples of com.flaptor.util.parser.HtmlParser

        switch (docType) {
            case HTML:
                Config conf = Config.getConfig("crawler.properties");
                String removedXPathElements = conf.getString("HtmlParser.removedXPath");
                String[] separatorTags = conf.getStringArray("HtmlParser.separatorTags");
                parser = new HtmlParser(removedXPathElements,separatorTags);
                break;
            case PDF:
                parser = new PdfParser();
                break;
        }

View Full Code Here

Examples of com.google.dart.engine.html.parser.HtmlParser

      AbstractScanner scanner = new StringScanner(source, content);
      scanner.setPassThroughElements(new String[] {TAG_SCRIPT});
      Token token = scanner.tokenize();
      lineInfo = new LineInfo(scanner.getLineStarts());
      final RecordingErrorListener errorListener = new RecordingErrorListener();
      unit = new HtmlParser(source, errorListener).parse(token, lineInfo);
      unit.accept(new RecursiveXmlVisitor<Void>() {
        @Override
        public Void visitHtmlScriptTagNode(HtmlScriptTagNode node) {
          resolveScriptDirectives(node.getScript(), errorListener);
          return null;

View Full Code Here

Examples of com.google.gwt.thirdparty.streamhtmlparser.HtmlParser

   *
   * @param html the HTML to check.
   * @return true if the provided HTML string is complete.
   */
  public static boolean isCompleteHtml(String html) {
    HtmlParser htmlParser = HtmlParserFactory.createParser();
    try {
      htmlParser.parse(html);
    } catch (ParseException e) {
      return false;
    }
    return htmlParser.getState() == HtmlParser.STATE_TEXT
        && !htmlParser.inJavascript() && !htmlParser.inCss();
  }

View Full Code Here

Examples of com.salas.bb.utils.htmlparser.HtmlParser

    static String process(String aText, int sizeLimit, boolean html)
    {
        if (aText == null) return null;


        IHtmlParserListener listener;
        HtmlParser parser = new HtmlParser(true);


        StringBuilderListener bufListener = new StringBuilderListener(aText.length(), sizeLimit);
        listener = html ? new SwingHtmlFilter(bufListener) : new SwingPlainFilter(bufListener);


        try
        {
            parser.parse(new StringReader(aText), listener);
        } catch (IOException e)
        {
            // OK. Buffer will be empty.
        }

View Full Code Here

Examples of com.scraper.parser.HTMLParser

  }


  @Test
  public void testParser()
  {
    HTMLParser parser = HTMLParser.parseImages("not a url");
    assertTrue(parser.hasError());
    parser = HTMLParser.parseImages("www.yahoocom");
    assertTrue(parser.hasError());
    parser = HTMLParser.parseImages("www.yahoo.com");
    assertTrue(!parser.hasError());
    parser = HTMLParser.parseImages("http://www.yahoo.com");
    assertTrue(!parser.hasError());
    parser = HTMLParser.parseImages("https://www.yahoo.com/");
    assertTrue(!parser.hasError());
    assertTrue(parser.hasNextImage());
  }

View Full Code Here

Examples of com.substanceofcode.utils.HTMLParser

    throws IOException, CauseMemoryException, CauseException, Exception {
        /** Initialize item collection */
        Vector rssFeeds = new Vector();
        
        /** Initialize XML parser and parse OPML XML */
        HTMLParser parser = new HTMLParser(encodingUtil);
        try {
            
      // The first element is the main tag.
            int elementType = parser.parse();
      // If we found the prologue, get the next entry.
      if( elementType == XmlParser.PROLOGUE ) {
        elementType = parser.parse();
      }
      if (elementType == XmlParser.END_DOCUMENT ) {
        return null;
      }
            
      boolean bodyFound = false;
            do {
        if (elementType == HTMLParser.REDIRECT_URL) {
          RssItunesFeed [] feeds = new RssItunesFeed[1];
          feeds[0] = new RssItunesFeed("", parser.getRedirectUrl(),
              "", "");
          return feeds;
        }
        /** RSS item properties */
        String title = "";
        String link = "";
                        
        String tagName = parser.getName();
        //#ifdef DLOGGING
        if (finerLoggable) {logger.finer("tagname: " + tagName);}
        //#endif
        if (tagName.length() == 0) {
          continue;
        }
        switch (tagName.charAt(0)) {
          case 'm':
          case 'M':
            if (bodyFound) {
              break;
            }
            break;
          case 'b':
          case 'B':
            if (!bodyFound) {
              bodyFound = parser.isBodyFound();
            }
            break;
          case 'a':
          case 'A':
            //#ifdef DLOGGING
            if (finerLoggable) {logger.finer("Parsing <a> tag");}
            //#endif
            
            title = parser.getText();
            // Title can be 0 as this is used also for
            // getting 
            title = title.trim();
            title = StringUtil.removeHtml( title );


            if (((link = parser.getAttributeValue( "href" ))
                  == null) || ( link.length() == 0 )) {
              continue;
            }
            link = link.trim();
            if ( link.length() == 0 ) {
              continue;
            }
            if (link.indexOf("://") >= 0) {
              if (!link.startsWith("http:") &&
                !link.startsWith("https:") &&
                !link.startsWith("file:") &&
                 !link.startsWith("jar:")) {
                //#ifdef DLOGGING
                if (finerLoggable) {logger.finer("Not support for protocol or no protocol=" + link);}
                //#endif
                continue;
              }
            } else {
              if (link.charAt(0) == '/') {
                int purl = url.indexOf("://");
                if ((purl + 4) >= url.length()) {
                  //#ifdef DLOGGING
                  if (finerLoggable) {logger.finer("Url too short=" + url + "," + purl);}
                  //#endif
                  continue;
                }
                int pslash = url.indexOf("/", purl + 3);
                String burl = url;
                if (pslash >= 0) {
                  burl = url.substring(0, pslash);
                }
                link = burl + link;
              } else {
                link = url + "/" + link;
              }
            }
            
            /** Debugging information */
            //#ifdef DLOGGING
            if (finerLoggable) {logger.finer("Title:       " + title);}
            if (finerLoggable) {logger.finer("Link:        " + link);}
            //#endif
            if (( feedURLFilter != null) &&
              ( link.toLowerCase().indexOf(feedURLFilter) < 0)) {
              continue;
            }
            
            if (( feedNameFilter != null) &&
              ((title != null) &&
              (title.toLowerCase().indexOf(feedNameFilter) < 0))) {
              continue;
            }
            RssItunesFeed feed = new RssItunesFeed(title, link, "", "");
            rssFeeds.addElement( feed );
            break;
          default:
        }
            }
            while( (elementType = parser.parse()) != XmlParser.END_DOCUMENT );
            
        } catch (CauseMemoryException ex) {
      CauseMemoryException cex = new CauseMemoryException(
          "Out of memory error while parsing HTML Link feed " + url,
          ex);

View Full Code Here

0 1 2 3 4 5 6

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.