parse(String url)
method, which will return a String with the contents of an HTML page, without the tags. After calling parse, you can get the HTML title (contents of the TITLE tag) by calling title(). Subclasses may override the handleText(), handleComment(), handleStartTag(), etc. methods so that parse(String url)
returns something other than the text of the web page. (For example, one may be interested in returning only part of the text, or only the links.)
@author Sepandar Kamvar (sdkamvar@stanford.edu)
By default, when using the constructor without arguments, the this parser coerces XML 1.0-incompatible infosets into XML 1.0-compatible infosets. This corresponds to ALTER_INFOSET
as the general XML violation policy. To make the parser support non-conforming HTML fully per the HTML 5 spec while on the other hand potentially violating the SAX2 API contract, set the general XML violation policy to ALLOW
. It is possible to treat XML 1.0 infoset violations as fatal by setting the general XML violation policy to FATAL
.
By default, this parser doesn't do true streaming but buffers everything first. The parser can be made truly streaming by calling setStreamabilityViolationPolicy(XmlViolationPolicy.FATAL)
. This has the consequence that errors that require non-streamable recovery are treated as fatal.
By default, in order to make the parse events emulate the parse events for a DTDless XML document, the parser does not report the doctype through LexicalHandler
. Doctype reporting through LexicalHandler
can be turned on by calling setReportingDoctype(true)
.
@version $Id$
@author hsivonen
Parses the HTML part of a message.
@author Jason Polites
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|