List of fullSequentialParse() Examples

Examples of fullSequentialParse()

net.htmlparser.jericho.Source.fullSequentialParse()
Parses all of the {@linkplain Tag tags} in this source document sequentially from beginning to end.
Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed.
Calling the {@link #getAllTags()}, {@link #getAllStartTags()}, {@link #getAllElements()}, {@link #getChildElements()}, {@link #iterator()} or {@link #getNodeIterator()}method on the Source object performs a full sequential parse automatically. There are however still circumstances where it should be called manually, such as when it is known that most or all of the tags in the document will need to be parsed, but none of the abovementioned methods are used, or are called only after calling one or more other tag search methods.
If this method is called manually, is should be called soon after the Source object is created, before any tag search methods are called.
By default, tags are parsed only as needed, which is referred to as parse on demand mode. In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a potential tag is in a {@linkplain TagType#isValidPosition(Source,int,int[]) valid position}.
Generally speaking, a tag is in a valid position if it does not appear inside any another tag. {@linkplain TagType#isServerTag() Server tags} can appear anywhere in a document, including inside other tags, so this relates only to non-server tags.Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed, otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it.
When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential tag is in a valid position. In parse on demand mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags is no longer practical. This compromise involves only checking whether the position is enclosed by other tags with {@linkplain TagType#getTagTypesIgnoringEnclosedMarkup() certain tag types}. The added complexity of this technique makes parsing each tag slower compared to when a full sequential parse is performed, but when only a few tags need parsing this is an extremely beneficial trade-off.
The documentation of the {@link TagType#isValidPosition(Source, int pos, int[] fullSequentialParseData)} method,which is called internally by the parser to perform the valid position check, includes a more detailed explanation of the differences between the two modes of operation.
Calling this method a second or subsequent time has no effect.
This method returns the same list of tags as the {@link Source#getAllTags() Source.getAllTags()} method, but as an array instead of a list.
If this method is called after any of the tag search methods are called, the {@linkplain #getCacheDebugInfo() cache} is cleared of any previously found tags before being restocked via the full sequential parse.This means that if you still have references to tags or elements from before the full sequential parse, they will not be the same objects as those that are returned by tag search methods after the full sequential parse, which can cause confusion if you are allocating {@linkplain Tag#setUserData(Object) user data} to tags.It is also significant if the {@link Segment#ignoreWhenParsing()} method has been called since the tags were first found, as any tags inside theignored segments will no longer be returned by any of the tag search methods.
See also the {@link Tag} class documentation for more general details about how tags are parsed. @return an array of all {@linkplain Tag tags} in this source document.

Examples of net.htmlparser.jericho.Source.fullSequentialParse()


  public static String extractTagMatching(String html, TagOccurrence toGet) {
    log.debug("looking for {} in tags: {}", toGet.getMatching(), toGet.getTag());
    String found = null;
    Source source = new Source(html);
    source.fullSequentialParse();
    log.debug("source = {}", source);
    List<Element> elements = source.getAllElements(HTMLElementName.TABLE);
    for (Element element : elements) {
      log.debug("this element = {}", element);
      String elementText = element.getTextExtractor().toString();

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

  }


  public static String extractSessionId(URL url, String sessionIDName) throws IOException {
    String sessionID = null;
    Source source = new Source(url);
    source.fullSequentialParse();
    List<Element> links = source.getAllElements(HTMLElementName.A);
    for (Element link : links) {
      // log.info("link: {}", link.toString());
      String href = link.getAttributeValue("href");
      if (href != null && href.contains(sessionIDName)) {

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

      throw new IllegalStateException("Manipulator " + this.getClass().getName() + " returned null.");
    }
    if (type == OperationType.Manipulator) {
      log.debug("reassigning source..");
      Source newSource = new Source(result);
      newSource.fullSequentialParse();
      this.source = newSource;
    }
    if (successor != null) {
      successor.execute(this.source);
    }

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()


  private String extractTagMatching(String html, TagOccurrence toGet) {
    log.debug("looking for {} in tags: {}", toGet.getMatching(), toGet.getTag());
    String found = null;
    Source source = new Source(html);
    source.fullSequentialParse();
    List<Element> elements = source.getAllElements(HTMLElementName.TABLE);
    for (Element element : elements) {
      String elementText = element.getTextExtractor().toString();
      if (elementText.contains(toGet.getMatching())) {
        found = element.toString();

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

      source = new Source(this.url);
    } catch (FileNotFoundException e) {
      log.info("Error while sourcing URL = {}, error description = {}", this.url, e.toString());
      return new ArrayList<Field>();
    }
    source.fullSequentialParse();


    List<Element> tables = source.getAllElements(HTMLElementName.TABLE);


    for (Element table : tables) {
      extractedFields.addAll(extractFieldsFromTable(table.toString()));

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()


  private List<Field> extractFieldsFromTable(String html) {
    // log.debug("extracting fields from table: {}", html);
    List<Field> extractedFields = new ArrayList<Field>();
    Source source = new Source(html);
    source.fullSequentialParse();
    List<Element> cells = source.getAllElements(HTMLElementName.TD);
    int rows = source.getAllElements(HTMLElementName.TR).size();
    log.debug("found {} cells in {} rows", cells.size(), rows);
    if (cells.size() == (rows * 2)) {
      Field lastField = null;

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

  }


  private List<Field> extractFieldsFromDL(String html) {
    List<Field> extractedFields = new ArrayList<Field>();
    Source source = new Source(html);
    source.fullSequentialParse();
    List<Element> labels = source.getAllElements(HTMLElementName.DT);
    List<Element> values = source.getAllElements(HTMLElementName.DD);
    int cellCount = Math.min(labels.size(), values.size());
    for (int i = 0; i < cellCount; i++) {
      String label = labels.get(i).getTextExtractor().toString().trim().replaceAll(":$", "");

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()


  private List<Field> extractFieldsFromTable(String html) {
    log.debug("extracting fields from table: {}", html);
    List<Field> extractedFields = new ArrayList<Field>();
    Source source = new Source(html);
    source.fullSequentialParse();
    int cellCount = source.getAllElements(HTMLElementName.TD).size();
    int rowCount = source.getAllElements(HTMLElementName.TR).size();
    log.debug("found {} cells in {} rows", cellCount, rowCount);
    if (cellCount == (rowCount * 2)) {
      Field lastField = null;

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

  }


  private List<Field> extractFieldsFromDL(String html) {
    List<Field> extractedFields = new ArrayList<Field>();
    Source source = new Source(html);
    source.fullSequentialParse();
    List<Element> labels = source.getAllElements(HTMLElementName.DT);
    List<Element> values = source.getAllElements(HTMLElementName.DD);
    int cellCount = Math.min(labels.size(), values.size());
    for (int i = 0; i < cellCount; i++) {
      String label = labels.get(i).getTextExtractor().toString().trim().replaceAll(":$", "");

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

  }


  private List<Field> extractFieldsFromUL(String html) {
    List<Field> extractedFields = new ArrayList<Field>();
    Source source = new Source(html);
    source.fullSequentialParse();
    List<Element> lis = source.getAllElements(HTMLElementName.LI);
    for (Element li : lis) {
      log.debug("looking at li: {} w/text: {}", li, li.getTextExtractor().toString());
      String[] parts = li.getTextExtractor().toString().split(":");
      if (parts.length == 2) {

View Full Code Here

0 1 2 3

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.