Examples of net.htmlparser.jericho.Element

net.htmlparser.jericho.Element
3.org/TR/html401/intro/sgmltut.html#h-3.2.1">element in a specific {@linkplain Source source} document, which encompasses a {@linkplain #getStartTag() start tag}, an optional {@linkplain #getEndTag() end tag} and all {@linkplain #getContent() content} in between.
Take the following HTML segment as an example:
This is a sample paragraph.
The whole segment is represented by an Element object. This is comprised of the {@link StartTag} "",the {@link EndTag} "", as well as the text in between.An element may also contain other elements between its start and end tags.
The term normal element refers to an element having a {@linkplain #getStartTag() start tag}with a {@linkplain StartTag#getStartTagType() type} of {@link StartTagType#NORMAL}. This comprises all {@linkplain HTMLElements HTML elements} and non-HTML elements.
Element instances are obtained using one of the following methods:
- {@link StartTag#getElement()}
- {@link EndTag#getElement()}
- {@link Segment#getAllElements()}
- {@link Segment#getAllElements(String name)}
- {@link Segment#getAllElements(StartTagType)}
See also the {@link HTMLElements} class, and the XML 1.0 specification for elements.
Element Structure

The three possible structures of an element are listed below:
Single Tag Element:
Example:
<img src="mypicture.jpg">
The element consists only of a single {@linkplain #getStartTag() start tag} and has no {@linkplain #getContent() element content}(although the start tag itself may have {@linkplain StartTag#getTagContent() tag content}).
{@link #getEndTag()}==null
{@link #isEmpty()}==true
{@link #getEnd() getEnd()}== {@link #getStartTag()}. {@link #getEnd() getEnd()}
This occurs in the following situations:
- An HTML element for which the {@linkplain HTMLElements#getEndTagForbiddenElementNames() end tag is forbidden}.
- An HTML element for which the {@linkplain HTMLElements#getEndTagRequiredElementNames() end tag is required}, but the end tag is not present in the source document.
- An HTML element for which the {@linkplain HTMLElements#getEndTagOptionalElementNames() end tag is optional}, where the implicitly terminating tag is situated immediately after the element's {@linkplain #getStartTag() start tag}.
- An {@linkplain #isEmptyElementTag() empty element tag}
- A non-HTML element that is not an {@linkplain #isEmptyElementTag() empty element tag} but is missing its end tag.
- An element with a start tag of a {@linkplain StartTag#getStartTagType() type} that does not define a{@linkplain StartTagType#getCorrespondingEndTagType() corresponding end tag type}.
- An element with a start tag of a {@linkplain StartTag#getStartTagType() type} that does define a{@linkplain StartTagType#getCorrespondingEndTagType() corresponding end tag type} but is missing its end tag.
Explicitly Terminated Element:
Example:
This is a sample paragraph.
The element consists of a {@linkplain #getStartTag() start tag}, {@linkplain #getContent() content}, and an {@linkplain #getEndTag() end tag}.
{@link #getEndTag()}!=null.
{@link #isEmpty()}==false (provided the end tag doesn't immediately follow the start tag)
{@link #getEnd() getEnd()}== {@link #getEndTag()}. {@link #getEnd() getEnd()}.
This occurs in the following situations, assuming the start tag's matching end tag is present in the source document:
- An HTML element for which the end tag is either {@linkplain HTMLElements#getEndTagRequiredElementNames() required} or {@linkplain HTMLElements#getEndTagOptionalElementNames() optional}.
- A non-HTML element that is not an {@linkplain #isEmptyElementTag() empty element tag}.
- An element with a start tag of a {@linkplain StartTag#getStartTagType() type} that defines a{@linkplain StartTagType#getCorrespondingEndTagType() corresponding end tag type}.
Implicitly Terminated Element:
Example:
This text is included in the paragraph element even though no end tag is present.
This is the next paragraph.
The element consists of a {@linkplain #getStartTag() start tag} and {@linkplain #getContent() content}, but no {@linkplain #getEndTag() end tag}.
{@link #getEndTag()}==null.
{@link #isEmpty()}==false
{@link #getEnd() getEnd()}!= {@link #getStartTag()}. {@link #getEnd() getEnd()}.
This only occurs in an HTML element for which the {@linkplain HTMLElements#getEndTagOptionalElementNames() end tag is optional}.
The element ends at the start of a tag which implies the termination of the element, called the implicitly terminating tag. If the implicitly terminating tag is situated immediately after the element's {@linkplain #getStartTag() start tag}, the element is classed as a single tag element.
See the element parsing rules for HTML elements with optional end tags for details on which tags can implicitly terminate a given element.
See also the documentation of the {@link HTMLElements#getEndTagOptionalElementNames()} method.
Element Parsing Rules
The following rules describe the algorithm used in the {@link StartTag#getElement()} method to construct an element.The detection of the start tag's matching end tag or other terminating tags always takes into account the possible nesting of elements.
- If the start tag has a {@linkplain StartTag#getStartTagType() type} of {@link StartTagType#NORMAL}:
 - If the {@linkplain StartTag#getName() name} of the start tag matches one of therecognised {@linkplain HTMLElementName HTML element names} (indicating an HTML element):
 - If the end tag for an element of this {@linkplain StartTag#getName() name} is{@linkplain HTMLElements#getEndTagForbiddenElementNames() forbidden}, the parser does not conduct any search for an end tag and a single tag element is created.
 - If the end tag for an element of this {@linkplain StartTag#getName() name} is {@linkplain HTMLElements#getEndTagRequiredElementNames() required}, the parser searches for the start tag's matching end tag.
 
 If the matching end tag is found, an explicitly terminated element is created.
 If no matching end tag is found, the source document is not valid HTML and the incident is {@linkplain Source#getLogger() logged} as a missing required end tag.In this situation a single tag element is created.
 - If the end tag for an element of this {@linkplain StartTag#getName() name} is{@linkplain HTMLElements#getEndTagOptionalElementNames() optional}, the parser searches not only for the start tag's matching end tag, but also for any other tag that implicitly terminates the element.
 For each tag (T2) following the start tag (ST1) of this element (E1):
 
 If T2 is a start tag:
 
 If the {@linkplain StartTag#getName() name} of T2 is in the list of{@linkplain HTMLElements#getNonterminatingElementNames(String) non-terminating element names} for E1,then continue evaluating tags from the {@linkplain Element#getEnd() end} of T2's corresponding{@linkplain StartTag#getElement() element}.
 If the {@linkplain StartTag#getName() name} of T2 is in the list of{@linkplain HTMLElements#getTerminatingStartTagNames(String) terminating start tag names} for E1,then E1 ends at the {@linkplain StartTag#getBegin() beginning} of T2.If T2 follows immediately after ST1, a single tag element is created, otherwise an implicitly terminated element is created.
 
 If T2 is an end tag:
 
 If the {@linkplain EndTag#getName() name} of T2 is the same as that of ST1,an explicitly terminated element is created.
 If the {@linkplain EndTag#getName() name} of T2 is in the list of{@linkplain HTMLElements#getTerminatingEndTagNames(String) terminating end tag names} for E1,then E1 ends at the {@linkplain EndTag#getBegin() beginning} of T2.If T2 follows immediately after ST1, a single tag element is created, otherwise an implicitly terminated element is created.
 
 If no more tags are present in the source document, then E1 ends at the end of the file, and an implicitly terminated element is created.
 Note that the syntactical indication of an {@linkplain StartTag#isSyntacticalEmptyElementTag() empty-element tag} in the start tagis ignored when determining the end of HTML elements. See the documentation of the {@link #isEmptyElementTag()} method for more information.
 - If the {@linkplain StartTag#getName() name} of the start tag does not match one of therecognised {@linkplain HTMLElementName HTML element names} (indicating a non-HTML element):
 - If the start tag is {@linkplain StartTag#isSyntacticalEmptyElementTag() syntactically an empty-element tag}, the parser does not conduct any search for an end tag and a single tag element is created.
 - Otherwise, section 3.1 of the XML 1.0 specification states that a matching end tag MUST be present, and the parser searches for the start tag's matching end tag.
 
 If the matching end tag is found, an explicitly terminated element is created.
 If no matching end tag is found, the source document is not valid XML and the incident is {@linkplain Source#getLogger() logged} as a missing required end tag.In this situation a single tag element is created.
- If the start tag has any {@linkplain StartTag#getStartTagType() type} other than {@link StartTagType#NORMAL}:
 - If the start tag's type does not define a {@linkplain StartTagType#getCorrespondingEndTagType() corresponding end tag type}, the parser does not conduct any search for an end tag and a single tag element is created.
 - If the start tag's type does define a {@linkplain StartTagType#getCorrespondingEndTagType() corresponding end tag type}, the parser assumes that a matching end tag is required and searches for it.
 - If the matching end tag is found, an explicitly terminated element is created.
 - If no matching end tag is found, the missing required end tag is {@linkplain Source#getLogger() logged}and a single tag element is created.
@see HTMLElements

    List<Element> labels = source.getAllElements(HTMLElementName.DT);
    List<Element> values = source.getAllElements(HTMLElementName.DD);
    int cellCount = Math.min(labels.size(), values.size());
    for (int i = 0; i < cellCount; i++) {
      String label = labels.get(i).getTextExtractor().toString().trim().replaceAll(":$", "");
      Element valueElement = values.get(i);
      log.debug("looking at value element: {}", valueElement);
      String value = getValueFieldText(valueElement);
      extractedFields.add(new ScrapedField(label, value));
    }
    return extractedFields;

View Full Code Here


  @Override
  public String performExtraction() {
    String attributeValue = "";
    if (getSource().getAllElements().size() > 0) {
      Element targetElement = getSource().getAllElements().get(0);
      attributeValue = targetElement.getAttributeValue(attributeName);
    }
    return attributeValue;
  }

View Full Code Here

  @Override
  public String performExtraction() {
    String extractedSource = "";
    if (tagOccurrence.getIdentifier() != null) {
      log.debug("about to splice: {}", tagOccurrence);
      Element element = ScraperUtil.extract(getSource(), tagOccurrence);
      if (element != null) {
        extractedSource = element.toString();
      }
      log.debug("spliced out: {}", extractedSource);
    } else {
      extractedSource = ScraperUtil.extract(getSource().toString(), tagOccurrence.getTag(),
          tagOccurrence.getOccurrence());

View Full Code Here

    log.debug("occurrence {} at {} to {}", new Object[] { occurrence, begin, length });
    return tags[occurrence].substring(begin, length);
  }


  public static Element extract(Source source, TagOccurrence tagOccurrence) {
    Element result = null;
    if (tagOccurrence.getElementIdentifierType() == ElementIdentifierType.cssClass) {
      List<Element> elements = source.getAllElementsByClass(tagOccurrence.getIdentifier());
      if(elements != null && !elements.isEmpty())
        result = elements.get(0);
    } else if (tagOccurrence.getElementIdentifierType() == ElementIdentifierType.ID) {

View Full Code Here

    String result = null;
    Source source = new Source(html);
    source.fullSequentialParse();
    if (tagOccurrence.getElementIdentifierType() == ElementIdentifierType.ID) {
      log.debug("extracting tag by id: {}", tagOccurrence.getIdentifier());
      Element idElement = source.getElementById(tagOccurrence.getIdentifier());
      if (idElement != null) {
        result = idElement.toString();
      } else {
        result = "";
      }
    } else if (tagOccurrence.getElementIdentifierType() == ElementIdentifierType.cssClass) {
      log.debug("extracting: {}", tagOccurrence);

View Full Code Here

    List<Field> extractedFields = new ArrayList<Field>();


    for (DesignatedField designatedField : this.fieldsToGet) {
            log.debug("designated field: {}", designatedField);
            log.debug("tag to get value from: {}", designatedField.getTagToGetValueFrom());
      Element elementWithValue = ScraperUtil.extract(getSource(), designatedField.getTagToGetValueFrom());
            log.debug("element with value: {}", elementWithValue);
      String value = getValueFieldText(elementWithValue);
      log.debug("looking for field: {}, value: {}", designatedField.getLabel(), value);
      extractedFields.add(new ScrapedField(designatedField.getLabel(), value));
    }

View Full Code Here

    if (cellCount == (rowCount * 2)) {
      Field lastField = null;
      log.debug("cells.size: {}", cellCount);
      List<Element> cells = source.getAllElements(HTMLElementName.TD);
      for (int i = 0; i < cellCount; i++) {
        Element labelElement = cells.get(i);
        Element valueElement = cells.get(++i);
        String label = labelElement.getTextExtractor().toString().trim().replaceAll(":$", "");
        String value = getValueFieldText(valueElement);
        log.debug("found field: {}={}", label, value);
        if (StringUtils.isEmpty(label) && lastField != null) {
          lastField.addValue(value);

View Full Code Here

    List<Element> labels = source.getAllElements(HTMLElementName.DT);
    List<Element> values = source.getAllElements(HTMLElementName.DD);
    int cellCount = Math.min(labels.size(), values.size());
    for (int i = 0; i < cellCount; i++) {
      String label = labels.get(i).getTextExtractor().toString().trim().replaceAll(":$", "");
      Element valueElement = values.get(i);
      log.debug("looking at value element: {}", valueElement);
      String value = getValueFieldText(valueElement);
      extractedFields.add(new ScrapedField(label, value));
    }
    return extractedFields;

View Full Code Here

  }


  private void removeInvalidFields(List<Element> fields) {
    java.util.Iterator<Element> iterator = fields.iterator();
    while (iterator.hasNext()) {
      Element field = iterator.next();
      if (!isAField(field.toString())) {
        log.debug("pruning invalid field: {}", field);
        iterator.remove();
      }
    }
  }

View Full Code Here

  @Override
  public List<Field> getFields() {
    List<Field> extractedFields = new ArrayList<Field>();


    for (DesignatedField designatedField : this.fieldsToGet) {
      Element elementWithValue = ScraperUtil.extract(getSource(), designatedField.getTagToGetValueFrom());
      String value = elementWithValue.getTextExtractor().toString();
      log.debug("looking for field: {}, value: {}", designatedField.getLabel(), value);
      extractedFields.add(new ScrapedField(designatedField.getLabel(), value));
    }


    return extractedFields;

View Full Code Here

0 1 2 3

TOP

Related Classes of net.htmlparser.jericho.Element

br.com.caelum.tubaina.parser.html.referencereplacer.AbstractReferenceReplacer

br.com.caelum.tubaina.parser.html.referencereplacer.ChapterAndSectionReferenceReplacer

br.com.caelum.tubaina.parser.html.referencereplacer.CodeReferenceReplacer

br.com.caelum.tubaina.parser.html.referencereplacer.ImageReferenceReplacer

br.com.caelum.tubaina.parser.html.referencereplacer.SingleHtmlChapterReferenceReplacer

br.com.caelum.tubaina.parser.html.referencereplacer.SingleHtmlSectionReferenceReplacer

com.alee.extended.style.StyleEditor

com.ontometrics.scraper.extraction.AttributeExtractor

com.ontometrics.scraper.extraction.DefaultFieldExtractor

com.ontometrics.scraper.extraction.DefaultFieldExtractorTest

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.

Examples of net.htmlparser.jericho.Element

Element Structure

Element Parsing Rules

Related Classes of net.htmlparser.jericho.Element