Examples of HTMLParser

appl.Portal.Utils.LinkSearch.HtmlParser
br.com.caelum.tubaina.parser.html.HtmlParser
br.com.caelum.tubaina.parser.html.desktop.HtmlParser
cn.edu.hfut.dmic.webcollector.parser.HtmlParser
默认的网页解析器 @author hu
com.flaptor.util.parser.HtmlParser
com.google.dart.engine.html.parser.HtmlParser
Instances of the class {@code HtmlParser} are used to parse tokens into a AST structure comprisedof {@link XmlNode}s. @coverage dart.engine.html
com.google.gwt.thirdparty.streamhtmlparser.HtmlParser
com.salas.bb.utils.htmlparser.HtmlParser
Simplpified and fast parser of HTML that detects text, tags and entities separately.
com.scraper.parser.HTMLParser
com.substanceofcode.utils.HTMLParser
Simple and lightweight HTML parser without complete error handling. @author Irving Bunton
de.mhus.lib.parser.HtmlParser
@author hummel
de.spotnik.util.html.HTMLParser
HTMLParser. @author Jens Rehp�hler @since 26.08.2006
edu.stanford.nlp.web.HTMLParser
Parses an HTML document and returns the plain text (and title). The main thing that HTMLParser is used for is the parse(String url) method, which will return a String with the contents of an HTML page, without the tags. After calling parse, you can get the HTML title (contents of the TITLE tag) by calling title(). Subclasses may override the handleText(), handleComment(), handleStartTag(), etc. methods so that parse(String url) returns something other than the text of the web page. (For example, one may be interested in returning only part of the text, or only the links.) @author Sepandar Kamvar (sdkamvar@stanford.edu)
nu.validator.htmlparser.sax.HtmlParser
This class implements an HTML5 parser that exposes data through the SAX2 interface.
By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy to ALLOW. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL.
By default, this parser doesn't do true streaming but buffers everything first. The parser can be made truly streaming by calling setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL). This has the consequence that errors that require non-streamable recovery are treated as fatal.
By default, in order to make the parse events emulate the parse events for a DTDless XML document, the parser does not report the doctype through LexicalHandler. Doctype reporting through LexicalHandler can be turned on by calling setReportingDoctype(true). @version $Id$ @author hsivonen
org.ajax4jsf.webapp.HtmlParser
org.apache.droids.parse.html.HtmlParser
@version 1.0
org.apache.jmeter.protocol.http.parser.HTMLParser
HtmlParsers can parse HTML content to obtain URLs.
org.apache.lenya.lucene.html.HTMLParser
HTML Parser
org.apache.lenya.lucene.parser.HTMLParser
org.apache.lucene.demo.html.HTMLParser
org.apache.nutch.parse.html.HtmlParser
org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlParser
HtmlParser.java @author Walter Kasper
org.apache.tika.parser.html.HtmlParser
HTML parser. Uses TagSoup to turn the input document to HTML SAX events, and post-processes the events to produce XHTML and metadata expected by Tika clients.
org.jasen.interfaces.HTMLParser

Parses the HTML part of a message.
@author Jason Polites
org.lobobrowser.html.parser.HtmlParser
rabbit.html.HtmlParser
This is a class that is used to parse a block of HTML code into separate tokens. This parser uses a recursive descent approach. @author Robert Olofsson
railo.runtime.search.lucene2.html.HTMLParser
saveReddit.parser.htmlParser
uk.ac.ucl.panda.utility.parser.HTMLParser
HTML Parsing Interfacew for test purposes
vmcreative.htmlparser.HTMLParser

Examples of com.substanceofcode.utils.HTMLParser

    throws IOException, CauseMemoryException, CauseException, Exception {
        /** Initialize item collection */
        Vector rssFeeds = new Vector();
        
        /** Initialize XML parser and parse OPML XML */
        HTMLParser parser = new HTMLParser(encodingUtil);
        try {
            
      // The first element is the main tag.
            int elementType = parser.parse();
      // If we found the prologue, get the next entry.
      if( elementType == XmlParser.PROLOGUE ) {
        elementType = parser.parse();
      }
      if (elementType == XmlParser.END_DOCUMENT ) {
        return null;
      }
            
      boolean windows = parser.isWindows();
      boolean utf = parser.isUtf();
      boolean process = true;
      boolean bodyFound = false;
            do {
        /** RSS item properties */
        String title = "";
        String link = "";
                        
        String tagName = parser.getName();
        //#ifdef DLOGGING
        if (finerLoggable) {logger.finer("tagname: " + tagName);}
        //#endif
        switch (tagName.charAt(0)) {
          case 'b':
          case 'B':
            if (bodyFound) {
              continue;
            }
            bodyFound = parser.isBodyFound();
            if (bodyFound) {
              windows = parser.isWindows();
              utf = parser.isUtf();
            }
            // If looking for OPML link, it is in header.
            if ((!needRss || needFirstRss) && bodyFound) {
              process = false;
              break;
            }
            break;
          case 'l':
          case 'L':
            if (!tagName.toLowerCase().equals("link")) {
              break;
            }
            //#ifdef DLOGGING
            if (finerLoggable) {logger.finer("Parsing <link> tag");}
            //#endif
            
            // TODO base
            String type = parser.getAttributeValue( "type" );
            if (type == null) {
              continue;
            }
            if (!needRss && (type.toLowerCase().indexOf("opml") < 0)) {
              continue;
            }
            if (needRss &&
                ((type.toLowerCase().indexOf("rss") < 0) &&
                (type.toLowerCase().indexOf("atom") < 0))) {
              continue;
            }
            title = parser.getAttributeValue( "title" );
            // Allow null title so that the caller can
            // check if it needs to get the title another way.
            if (title != null) {
              title = EncodingUtil.replaceAlphaEntities(true,
                  title);
              title = EncodingUtil.replaceNumEntity(title);
              // Replace special chars like left quote, etc.
              // Since we have already converted to unicode, we want
              // to replace with uni chars.
              title = encodingUtil.replaceSpChars(title);


              title = StringUtil.removeHtml(title);
            }
            if (((link = parser.getAttributeValue( "href" ))
                  == null) || ( link.length() == 0 )) {
              continue;
            }
            if (link.charAt(0) == '/') {
              link = url + link;
            }
            
            /** Debugging information */
            System.out.println("Title:       " + title);
            System.out.println("Link:        " + link);
            
            /** 
             * Create new RSS item and add it do RSS document's item
             * collection.  Account for wrong OPML which is an
             * OPML composed of other OPML.  These have url attribute
             * instead of link attribute.
             */
            if (!needRss || needFirstRss) {
              RssItunesFeed feed = new RssItunesFeed(title, link, "", "");
              rssFeeds.addElement( feed );
              process = false;
              break;
            }
            if (( feedURLFilter != null) &&
              ( link.toLowerCase().indexOf(feedURLFilter) < 0)) {
              continue;
            }
            if (( feedNameFilter != null) &&
              ((title != null) &&
              (title.toLowerCase().indexOf(feedNameFilter) < 0))) {
              continue;
            }
            RssItunesFeed feed = new RssItunesFeed(title, link, "", "");
            rssFeeds.addElement( feed );
            break;
          default:
        }
      }
            while( process && (parser.parse() != XmlParser.END_DOCUMENT) );
            
        } catch (CauseMemoryException ex) {
      CauseMemoryException cex = new CauseMemoryException(
          "Out of memory error while parsing HTML auto link feed " +
          url, ex);

View Full Code Here

Examples of com.substanceofcode.utils.HTMLParser

    throws IOException, CauseMemoryException, CauseException, Exception {
        /** Initialize item collection */
        Vector rssFeeds = new Vector();
        
        /** Initialize XML parser and parse OPML XML */
        HTMLParser parser = new HTMLParser(encodingUtil);
        try {
            
      // The first element is the main tag.
            int elementType = parser.parse();
      // If we found the prologue, get the next entry.
      if( elementType == XmlParser.PROLOGUE ) {
        elementType = parser.parse();
      }
      if (elementType == XmlParser.END_DOCUMENT ) {
        return null;
      }
            
      boolean windows = parser.isWindows();
      boolean utf = parser.isUtf();
      boolean process = true;
      boolean bodyFound = false;
            do {
        /** RSS item properties */
        String title = "";
        String link = "";
                        
        String tagName = parser.getName();
        //#ifdef DLOGGING
//@        if (finerLoggable) {logger.finer("tagname: " + tagName);}
        //#endif
        switch (tagName.charAt(0)) {
          case 'b':
          case 'B':
            if (bodyFound) {
              continue;
            }
            bodyFound = parser.isBodyFound();
            if (bodyFound) {
              windows = parser.isWindows();
              utf = parser.isUtf();
            }
            // If looking for OPML link, it is in header.
            if ((!needRss || needFirstRss) && bodyFound) {
              process = false;
              break;
            }
            break;
          case 'l':
          case 'L':
            if (!tagName.toLowerCase().equals("link")) {
              break;
            }
            //#ifdef DLOGGING
//@            if (finerLoggable) {logger.finer("Parsing <link> tag");}
            //#endif
            
            // TODO base
            String type = parser.getAttributeValue( "type" );
            if (type == null) {
              continue;
            }
            if (!needRss && (type.toLowerCase().indexOf("opml") < 0)) {
              continue;
            }
            if (needRss &&
                ((type.toLowerCase().indexOf("rss") < 0) &&
                (type.toLowerCase().indexOf("atom") < 0))) {
              continue;
            }
            title = parser.getAttributeValue( "title" );
            // Allow null title so that the caller can
            // check if it needs to get the title another way.
            if (title != null) {
              title = EncodingUtil.replaceAlphaEntities(true,
                  title);
              title = EncodingUtil.replaceNumEntity(title);
              // Replace special chars like left quote, etc.
              // Since we have already converted to unicode, we want
              // to replace with uni chars.
              title = encodingUtil.replaceSpChars(title);


              title = StringUtil.removeHtml(title);
            }
            if (((link = parser.getAttributeValue( "href" ))
                  == null) || ( link.length() == 0 )) {
              continue;
            }
            if (link.charAt(0) == '/') {
              link = url + link;
            }
            
            /** Debugging information */
            System.out.println("Title:       " + title);
            System.out.println("Link:        " + link);
            
            /** 
             * Create new RSS item and add it do RSS document's item
             * collection.  Account for wrong OPML which is an
             * OPML composed of other OPML.  These have url attribute
             * instead of link attribute.
             */
            if (!needRss || needFirstRss) {
              RssItunesFeed feed = new RssItunesFeed(title, link, "", "");
              rssFeeds.addElement( feed );
              process = false;
              break;
            }
            if (( feedURLFilter != null) &&
              ( link.toLowerCase().indexOf(feedURLFilter) < 0)) {
              continue;
            }
            if (( feedNameFilter != null) &&
              ((title != null) &&
              (title.toLowerCase().indexOf(feedNameFilter) < 0))) {
              continue;
            }
            RssItunesFeed feed = new RssItunesFeed(title, link, "", "");
            rssFeeds.addElement( feed );
            break;
          default:
        }
      }
            while( process && (parser.parse() != XmlParser.END_DOCUMENT) );
            
        } catch (CauseMemoryException ex) {
      CauseMemoryException cex = new CauseMemoryException(
          "Out of memory error while parsing HTML auto link feed " +
          url, ex);

View Full Code Here

Examples of com.substanceofcode.utils.HTMLParser

    throws IOException, CauseMemoryException, CauseException, Exception {
        /** Initialize item collection */
        Vector rssFeeds = new Vector();
        
        /** Initialize XML parser and parse OPML XML */
        HTMLParser parser = new HTMLParser(encodingUtil);
        try {
            
      // The first element is the main tag.
            int elementType = parser.parse();
      // If we found the prologue, get the next entry.
      if( elementType == XmlParser.PROLOGUE ) {
        elementType = parser.parse();
      }
      if (elementType == XmlParser.END_DOCUMENT ) {
        return null;
      }
            
      boolean bodyFound = false;
            do {
        if (elementType == HTMLParser.REDIRECT_URL) {
          RssItunesFeed [] feeds = new RssItunesFeed[1];
          feeds[0] = new RssItunesFeed("", parser.getRedirectUrl(),
              "", "");
          return feeds;
        }
        /** RSS item properties */
        String title = "";
        String link = "";
                        
        String tagName = parser.getName();
        //#ifdef DLOGGING
//@        if (finerLoggable) {logger.finer("tagname: " + tagName);}
        //#endif
        if (tagName.length() == 0) {
          continue;
        }
        switch (tagName.charAt(0)) {
          case 'm':
          case 'M':
            if (bodyFound) {
              break;
            }
            break;
          case 'b':
          case 'B':
            if (!bodyFound) {
              bodyFound = parser.isBodyFound();
            }
            break;
          case 'a':
          case 'A':
            //#ifdef DLOGGING
//@            if (finerLoggable) {logger.finer("Parsing <a> tag");}
            //#endif
            
            title = parser.getText();
            // Title can be 0 as this is used also for
            // getting 
            title = title.trim();
            title = StringUtil.removeHtml( title );


            if (((link = parser.getAttributeValue( "href" ))
                  == null) || ( link.length() == 0 )) {
              continue;
            }
            link = link.trim();
            if ( link.length() == 0 ) {
              continue;
            }
            if (link.indexOf("://") >= 0) {
              if (!link.startsWith("http:") &&
                !link.startsWith("https:") &&
                !link.startsWith("file:") &&
                 !link.startsWith("jar:")) {
                //#ifdef DLOGGING
//@                if (finerLoggable) {logger.finer("Not support for protocol or no protocol=" + link);}
                //#endif
                continue;
              }
            } else {
              if (link.charAt(0) == '/') {
                int purl = url.indexOf("://");
                if ((purl + 4) >= url.length()) {
                  //#ifdef DLOGGING
//@                  if (finerLoggable) {logger.finer("Url too short=" + url + "," + purl);}
                  //#endif
                  continue;
                }
                int pslash = url.indexOf("/", purl + 3);
                String burl = url;
                if (pslash >= 0) {
                  burl = url.substring(0, pslash);
                }
                link = burl + link;
              } else {
                link = url + "/" + link;
              }
            }
            
            /** Debugging information */
            //#ifdef DLOGGING
//@            if (finerLoggable) {logger.finer("Title:       " + title);}
//@            if (finerLoggable) {logger.finer("Link:        " + link);}
            //#endif
            if (( feedURLFilter != null) &&
              ( link.toLowerCase().indexOf(feedURLFilter) < 0)) {
              continue;
            }
            
            if (( feedNameFilter != null) &&
              ((title != null) &&
              (title.toLowerCase().indexOf(feedNameFilter) < 0))) {
              continue;
            }
            RssItunesFeed feed = new RssItunesFeed(title, link, "", "");
            rssFeeds.addElement( feed );
            break;
          default:
        }
            }
            while( (elementType = parser.parse()) != XmlParser.END_DOCUMENT );
            
        } catch (CauseMemoryException ex) {
      CauseMemoryException cex = new CauseMemoryException(
          "Out of memory error while parsing HTML Link feed " + url,
          ex);

View Full Code Here

Examples of de.mhus.lib.parser.HtmlParser

import junit.framework.TestCase;


public class HtmlParserTest extends TestCase {


  public void testParser() throws IOException {
    HtmlParser parser = new HtmlParser();
    parser.setTrim(true);
    
    HtmlListener listener = new HtmlListener();
    
    Reader in = new InputStreamReader( MSystem.locateResource(this, getClass().getSimpleName() + ".xml").openStream() );
    
    parser.parse(in, listener);
    
    assertTrue(listener.pi.getFirst().equals("xml version=\"1.0\" encoding=\"ISO-8859-1\""));
    assertTrue(listener.note.getFirst().equals("Edited"));
    assertTrue(listener.open.size() == listener.close.size());
    assertTrue(listener.single.size() == 1);

View Full Code Here

Examples of de.spotnik.util.html.HTMLParser

     * @param body the html message body
     * @return the content of the html without tags
     */
    public static String getText( String body)
    {
        HTMLParser parser = new HTMLParser(new StringReader(body));
        StringBuffer text = new StringBuffer();
        String line;
        
        try
        {
            BufferedReader reader = new BufferedReader(parser.getReader());
            
            while( (line = reader.readLine()) != null)
            {
                text.append(line + "\n");
            }

View Full Code Here

Examples of edu.stanford.nlp.web.HTMLParser

  public List<Word> getWordsFromHTML(String fileOrURL) throws IOException {
    return getWordsFromHTML(fileOrURLToReader(fileOrURL));
  }


  public List<Word> getWordsFromHTML(Reader input) {
    HTMLParser parser = new HTMLParser();
    try {
      String s = parser.parse(input);
      return getWordsFromText(new StringReader(s));
    } catch (IOException e) {
      System.err.println("IOException" + e.getMessage());
    }
    return null;

View Full Code Here

Examples of nu.validator.htmlparser.sax.HtmlParser

@SupportedFormat("text/html")
public class HTMLRDFaParser extends ClerezzaRDFaParser {


  @Override
  public XMLReader getReader() {
    HtmlParser reader = new HtmlParser();
    reader.setXmlPolicy(XmlViolationPolicy.ALLOW);
    reader.setXmlnsPolicy(XmlViolationPolicy.ALLOW);
    reader.setMappingLangToXmlLang(false);
    return reader;
  }

View Full Code Here

Examples of org.ajax4jsf.webapp.HtmlParser

  private static ArrayStack _xhtmlParsersPool = new ArrayStack(STACK_SIZE);
  
  public NekkoXMLFilter() {}


  protected HtmlParser getParser(String mimetype, boolean isAjax, String viewId) {
    HtmlParser parser = null;
    if( isAjax ){
      parser = getXmlParser();
    } else if (mimetype.startsWith(TEXT_HTML) || mimetype.startsWith(APPLICATION_XHTML_XML)) {
      parser = new FastHtmlParser();
    } else {

View Full Code Here

Examples of org.apache.droids.parse.html.HtmlParser

public class DroidsFactory
{
  
  public static ParserFactory createDefaultParserFactory() {
    ParserFactory parserFactory = new ParserFactory();
    HtmlParser htmlParser = new HtmlParser();
    htmlParser.setElements(new HashMap<String, String>());
    htmlParser.getElements().put("a", "href");
    htmlParser.getElements().put("link", "href");
    htmlParser.getElements().put("img", "src");
    htmlParser.getElements().put("script", "src");
    parserFactory.setMap(new HashMap<String, Object>());
    parserFactory.getMap().put("text/html", htmlParser);
    return parserFactory;
  }

View Full Code Here

Examples of org.apache.droids.parse.html.HtmlParser

    }
    String targetURL = args[0];
    
    // Create parser factory. Support basic HTML markup only
    ParserFactory parserFactory = new ParserFactory();
    HtmlParser htmlParser = new HtmlParser();
    htmlParser.setElements(new HashMap<String, String>());
    htmlParser.getElements().put("a", "href");
    htmlParser.getElements().put("link", "href");
    htmlParser.getElements().put("img", "src");
    htmlParser.getElements().put("script", "src");
    parserFactory.setMap(new HashMap<String, Object>());
    parserFactory.getMap().put("text/html", htmlParser);


    // Create protocol factory. Support HTTP/S only.
    ProtocolFactory protocolFactory = new ProtocolFactory();

View Full Code Here

0 1 2 3 4 5 6

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.