List of fullSequentialParse() Examples

Examples of fullSequentialParse()

net.htmlparser.jericho.Source.fullSequentialParse()
Parses all of the {@linkplain Tag tags} in this source document sequentially from beginning to end.
Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed.
Calling the {@link #getAllTags()}, {@link #getAllStartTags()}, {@link #getAllElements()}, {@link #getChildElements()}, {@link #iterator()} or {@link #getNodeIterator()}method on the Source object performs a full sequential parse automatically. There are however still circumstances where it should be called manually, such as when it is known that most or all of the tags in the document will need to be parsed, but none of the abovementioned methods are used, or are called only after calling one or more other tag search methods.
If this method is called manually, is should be called soon after the Source object is created, before any tag search methods are called.
By default, tags are parsed only as needed, which is referred to as parse on demand mode. In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a potential tag is in a {@linkplain TagType#isValidPosition(Source,int,int[]) valid position}.
Generally speaking, a tag is in a valid position if it does not appear inside any another tag. {@linkplain TagType#isServerTag() Server tags} can appear anywhere in a document, including inside other tags, so this relates only to non-server tags.Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed, otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it.
When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential tag is in a valid position. In parse on demand mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags is no longer practical. This compromise involves only checking whether the position is enclosed by other tags with {@linkplain TagType#getTagTypesIgnoringEnclosedMarkup() certain tag types}. The added complexity of this technique makes parsing each tag slower compared to when a full sequential parse is performed, but when only a few tags need parsing this is an extremely beneficial trade-off.
The documentation of the {@link TagType#isValidPosition(Source, int pos, int[] fullSequentialParseData)} method,which is called internally by the parser to perform the valid position check, includes a more detailed explanation of the differences between the two modes of operation.
Calling this method a second or subsequent time has no effect.
This method returns the same list of tags as the {@link Source#getAllTags() Source.getAllTags()} method, but as an array instead of a list.
If this method is called after any of the tag search methods are called, the {@linkplain #getCacheDebugInfo() cache} is cleared of any previously found tags before being restocked via the full sequential parse.This means that if you still have references to tags or elements from before the full sequential parse, they will not be the same objects as those that are returned by tag search methods after the full sequential parse, which can cause confusion if you are allocating {@linkplain Tag#setUserData(Object) user data} to tags.It is also significant if the {@link Segment#ignoreWhenParsing()} method has been called since the tags were first found, as any tags inside theignored segments will no longer be returned by any of the tag search methods.
See also the {@link Tag} class documentation for more general details about how tags are parsed. @return an array of all {@linkplain Tag tags} in this source document.

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

   * @throws IOException
   */
  public String execute() throws IOException {
    String result = "";
    Source source = new Source(url);
    source.fullSequentialParse();
    // log.debug("parsed source: {}", source.toString());


    if (idToGet != null) {
      result = source.getElementById(idToGet).getTextExtractor().toString();
    } else if (classToGet != null) {

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

   */
  public List<String> getResults() throws IOException {
    log.debug("extracting results from url: {}", url);
    List<String> results = new ArrayList<String>();
    Source source = new Source(url);
    source.fullSequentialParse();
    String content = source.toString();
    List<Element> currentElements = null;
    for (TagOccurrence toGet : tagsToGet) {
      log.debug("toGet = {}", toGet);
      if (toGet.getOccurrence() > 0) {

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

    for (int i = 0; i < afterTagOccurrence.getOccurrence(); i++) {
      sourceHtml = sourceHtml.substring(sourceHtml.indexOf(endAfterTag) + 1);
    }
    String afterSource = sourceHtml;
    Source newSource = new Source(afterSource);
    newSource.fullSequentialParse();
    return newSource;
  }


  public Extractor asText() {
    this.outputFormat = OutputFormats.Text;

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

      extractedFields.addAll(defaultFieldExtractions());
    }


    if (this.classToGet != null) {
      Source source = new Source(url);
      source.fullSequentialParse();
      List<Element> elements = source.getAllElementsByClass(classToGet);
      String text = elements.get(0).toString();
      String[] fields = text.split("<br>");
      log.debug("fields: {}", fields);
      for (String field : fields) {

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

    }


    Source source = new Source(url);
    for (TagOccurrence tagOccurrence : this.tagsToGet) {
      // log.debug("extracting fields using tag: {}", tagOccurrence);
      source.fullSequentialParse();
      if (!(tagOccurrence.getTag().contains(HTMLElementName.TABLE) || tagOccurrence.getTag().contains(
          HTMLElementName.A))) {
        throw new IllegalStateException(MessageFormat.format(
            "Asked to extract tag: {0}, only know how to extract fields from tables.",
            tagOccurrence.getTag()));

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

        }


      }
    }
    source = new Source(url);
    source.fullSequentialParse();
    if (this.afterTagOccurrence != null) {
      source = pruneFrom(source, afterTagOccurrence);
    }
    for (FieldToGet fieldToGet : fieldsToGet) {
      String value = "";

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()


  private List<Field> extractLinksFromList(String html) {
    log.debug("extracting links...");
    List<Field> fields = new ArrayList<Field>();
    Source source = new Source(html);
    source.fullSequentialParse();
    List<Element> links = source.getAllElements(HTMLElementName.A);
    for (Element a : links) {
      String label = a.getTextExtractor().toString();
      String href = a.getAttributeValue("href");
      if (matchingPattern == null || href.contains(matchingPattern)) {

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

    return this;
  }


  private boolean fieldHasMultipleValues(String fieldValue) {
    Source source = new Source(fieldValue);
    source.fullSequentialParse();
    return source.getAllElements(HTMLElementName.BR).size() > 1;
  }


  private String delimitFieldValues(String source) {
    Source result = new Source(source.replace("<br>", ";").replace("<br/>", ";"));

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

    }
  }


  public static List<Link> extractLinks(String sourceToParse) {
    Source source = new Source(sourceToParse);
    source.fullSequentialParse();
    List<Link> links = new ArrayList<Link>();
    List<Element> as = source.getAllElements(HTMLElementName.A);
    for (Element linkElement : as) {
      links.add(new Link(linkElement.getTextExtractor().toString(), linkElement.getAttributeValue("href")));
    }

View Full Code Here

Examples of net.htmlparser.jericho.Source.fullSequentialParse()

  }


  public static String extractUsingIdentifier(String html, TagOccurrence tagOccurrence) {
    String result = null;
    Source source = new Source(html);
    source.fullSequentialParse();
    if (tagOccurrence.getElementIdentifierType() == ElementIdentifierType.ID) {
      log.debug("extracting tag by id: {}", tagOccurrence.getIdentifier());
      Element idElement = source.getElementById(tagOccurrence.getIdentifier());
      if (idElement != null) {
        result = idElement.toString();

View Full Code Here

0 1 2 3

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.