Examples of TokenData

org.apache.stanbol.enhancer.engines.entitylinking.impl.ProcessingState.TokenData
org.apache.stanbol.enhancer.engines.entitylinking.impl.TokenData
Internally used to store additional Metadata for Tokens of the current Sentence
Checks if the parsed {@link Token} is processable. This decision is taken first based on the POSannotation ( Lexical Category, POS tag) and second on the {@link EntityLinkerConfig#getMinSearchTokenLength()} if no POS annotations are available or theprobability of the POS annotations is to low.
Since STANBOL-685two POS Probabilities are used
- {@link LanguageProcessingConfig#getMinPosAnnotationProbability()} for accepting POS tags that areprocessed - included in {@link LanguageProcessingConfig#getLinkedLexicalCategories()} or{@link LanguageProcessingConfig#getLinkedPosTags()}.
- {@link LanguageProcessingConfig#getMinExcludePosAnnotationProbability()} for those that are notprocessed. By default the exclusion probability is set to half of the inclusion one.
Assuming that the minPosTypePropb=0.667 a
- noun with the prop 0.8 would result in returning true
- noun with prop 0.5 would return null
- verb with prop 0.4 would return false
- verb with prop 0.3 would return null
This algorithm makes it less likely that the {@link EntityLinkerConfig#getMinSearchTokenLength()} needsto be used as fallback for Tokens (what typically still provides better estimations as the token length).
(see also STANBOL-685 even that this Issue refers a version of this Engine that has not yet used the Stanbol NLP processing chain) @param token the {@link Token} to check. @return true if the parsed token needs to be processed. Otherwise false

Examples of org.apache.stanbol.enhancer.engines.entitylinking.impl.ProcessingState.TokenData

     * Steps over the sentences, chunks, tokens of the {@link #sentences}
     */
    public void process() throws EntitySearcherException {
        //int debugedIndex = 0;
        while(state.next()) {
            TokenData token = state.getToken();
            if(log.isDebugEnabled()){
                log.debug("--- preocess Token {}: {} (lemma: {} | pos:{}) chunk: {}",
                    new Object[]{token.index,token.token.getSpan(),
                                 token.morpho != null ? token.morpho.getLemma() : "none", 
                                 token.token.getAnnotations(POS_ANNOTATION),
                                 token.inChunk != null ? 
                                         (token.inChunk.chunk + " "+ token.inChunk.chunk.getSpan()) : 
                                             "none"});
            }
            List<String> searchStrings = new ArrayList<String>(linkerConfig.getMaxSearchTokens());
            searchStrings.add(token.getTokenText());
            //Determine the range we are allowed to search for tokens
            final int minIncludeIndex;
            final int maxIndcludeIndex;
            //NOTE: testing has shown that using Chunks to restrict search for
            //      additional matchable tokens does have an negative impact on
            //      recall. Because of that this restriction is for now deactivated
            boolean restrirctContextByChunks = false; //TODO: maybe make configurable
            if(token.inChunk != null && !textProcessingConfig.isIgnoreChunks() &&
                    restrirctContextByChunks){
                minIncludeIndex = Math.max(
                    state.getConsumedIndex()+1, 
                    token.inChunk.startToken);
                maxIndcludeIndex = token.inChunk.endToken;
            } else {
                maxIndcludeIndex = state.getTokens().size() - 1;
                minIncludeIndex = state.getConsumedIndex() + 1;
            }
            int prevIndex,pastIndex; //search away from the currently active token
            int distance = 0;
            do {
                distance++;
                prevIndex = token.index-distance;
                pastIndex = token.index+distance;
                if(minIncludeIndex <= prevIndex){
                    TokenData prevToken = state.getTokens().get(prevIndex);
                    if(log.isDebugEnabled()){
                        log.debug("    {} {}:'{}' (lemma: {} | pos:{})",new Object[]{
                            prevToken.isMatchable? '+':'-',prevToken.index,
                            prevToken.token.getSpan(),
                            prevToken.morpho != null ? prevToken.morpho.getLemma() : "none",
                            prevToken.token.getAnnotations(POS_ANNOTATION)
                        });
                    }
                    if(prevToken.isMatchable){
                        searchStrings.add(0,prevToken.getTokenText());
                    }
                }
                if(maxIndcludeIndex >= pastIndex){
                    TokenData pastToken = state.getTokens().get(pastIndex);
                    if(log.isDebugEnabled()){
                        log.debug("    {} {}:'{}' (lemma: {} | pos:{})",new Object[]{
                            pastToken.isMatchable? '+':'-',pastToken.index,
                            pastToken.token.getSpan(),
                            pastToken.morpho != null ? pastToken.morpho.getLemma() : "none",
                            pastToken.token.getAnnotations(POS_ANNOTATION)
                        });
                    }
                    if(pastToken.isMatchable){
                        searchStrings.add(pastToken.getTokenText());
                    }
                }
            } while(searchStrings.size() < linkerConfig.getMaxSearchTokens() && distance <
                    linkerConfig.getMaxSearchDistance() &&
                    (prevIndex > minIncludeIndex || pastIndex < maxIndcludeIndex));

View Full Code Here

Examples of org.apache.stanbol.enhancer.engines.entitylinking.impl.ProcessingState.TokenData

        int firstProcessableFoundIndex = -1;
        int lastFoundIndex = -1;
        int lastProcessableFoundIndex = -1;
        int firstFoundLabelIndex = -1;
        int lastfoundLabelIndex = -1;
        TokenData currentToken;
        String currentTokenText;
        int currentTokenLength;
        int notFound = 0;
        int matchedTokensNotWithinProcessableTokenSpan = 0;
        int foundTokensWithinCoveredProcessableTokens = 0;
        float minTokenMatchFactor = linkerConfig.getMinTokenMatchFactor();
        //search for matches within the correct order
        for(int currentIndex = state.getToken().index;
                currentIndex < state.getTokens().size() 
                && search ;currentIndex++){
            currentToken = state.getTokens().get(currentIndex);
            if(currentToken.hasAlphaNumeric){
                currentTokenText = currentToken.getTokenText();
                if(!linkerConfig.isCaseSensitiveMatching()){
                    currentTokenText = currentTokenText.toLowerCase();
                }
                currentTokenLength = currentTokenText.length();
                boolean found = false;
                float matchFactor = 0f;
                //iteration starts at the next token after the last matched one
                //so it is OK to skip tokens in the label, but not within the text
                for(int i = lastfoundLabelIndex+1;!found && i < labelTokens.length;i ++){
                    String labelTokenText = labelTokens[i];
                    int labelTokenLength = labelTokenText.length();
                    float maxLength = currentTokenLength > labelTokenLength ? currentTokenLength : labelTokenLength;
                    float lengthDif = Math.abs(currentTokenLength - labelTokenLength);
                    if((lengthDif/maxLength)<=(1-minTokenMatchFactor)){ //this prevents unnecessary string comparison 
                        int matchCount = compareTokens(currentTokenText, labelTokenText);
                        if(matchCount/maxLength >= minTokenMatchFactor){
                            lastfoundLabelIndex = i; //set the last found index to the current position
                            found = true; //set found to true -> stops iteration
                            matchFactor = matchCount/maxLength; //how good is the match
                            //remove matched labels from the set to disable them for
                            //a later random oder search
                            labelTokenSet.remove(labelTokenText);
                        }
                    }
                }
                if(!found){
                    //search for a match in the wrong order
                    //currently only exact matches (for testing)
                    if(found = labelTokenSet.remove(currentTokenText)){
                        matchFactor = 0.7f;
                    }
                }
                //int found = text.indexOf(currentToken.getText().toLowerCase());
                if(found){ //found
                    if(currentToken.isMatchable){
                        foundProcessableTokens++; //only count processable Tokens
                        if(firstProcessableFoundIndex < 0){
                            firstProcessableFoundIndex = currentIndex;
                        }
                        lastProcessableFoundIndex = currentIndex;
                        foundTokensWithinCoveredProcessableTokens++;
                        if(matchedTokensNotWithinProcessableTokenSpan > 0){
                            foundTokensWithinCoveredProcessableTokens = foundTokensWithinCoveredProcessableTokens +
                                    matchedTokensNotWithinProcessableTokenSpan;
                            matchedTokensNotWithinProcessableTokenSpan = 0;
                        }
                    } else {
                        matchedTokensNotWithinProcessableTokenSpan++;
                    }
                    foundTokens++;
                    foundTokenMatch = foundTokenMatch + matchFactor; //sum up the matches
                    if(firstFoundIndex < 0){
                        firstFoundIndex = currentIndex;
                        firstFoundLabelIndex = lastfoundLabelIndex;
                    }
                    lastFoundIndex = currentIndex;
                } else { //not found
                    notFound++;
                    if(currentToken.isMatchable || notFound > linkerConfig.getMaxNotFound()){
                        //stop as soon as a token that needs to be processed is
                        //not found in the label or the maximum number of tokens
                        //that are not processable are not found
                        search = false; 
                    }
                }
            } // else token without alpha or numeric characters are not processed
        }
        //search backwards for label tokens until firstFoundLabelIndex if there
        //are unconsumed Tokens in the sentence before state.getTokenIndex
        int currentIndex = state.getToken().index-1;
        int labelIndex = firstFoundLabelIndex-1;
        notFound = 0;
        matchedTokensNotWithinProcessableTokenSpan = 0;
        search = true;
        while(search && labelIndex >= 0 && currentIndex > state.getConsumedIndex()){
            String labelTokenText = labelTokens[labelIndex];
            if(labelTokenSet.contains(labelTokenText)){ //still not matched
                currentToken = state.getTokens().get(currentIndex);
                currentTokenText = currentToken.getTokenText();
                if(!linkerConfig.isCaseSensitiveMatching()){
                    currentTokenText = currentTokenText.toLowerCase();
                }
                currentTokenLength = currentTokenText.length();
                boolean found = false;

View Full Code Here

Examples of org.apache.stanbol.enhancer.engines.entitylinking.impl.TokenData

    @Override
    public boolean incrementToken() throws IOException {
        if(input.incrementToken()){
            incrementCount++;
            boolean first = true;
            TokenData token; 
            boolean lookup = false;
            int lastMatchable = -1;
            int lastIndex = -1;
            if(log.isTraceEnabled()){
              log.trace("> solr:[{},{}] {}",new Object[]{
                              offset.startOffset(), offset.endOffset(), termAtt});
            }
            while((token = nextToken(first)) != null){
              if(log.isTraceEnabled()) {
                  log.trace("  < [{},{}]:{} (link {}, match; {})",new Object[]{
                          token.token.getStart(), token.token.getEnd(),token.getTokenText(),
                          token.isLinkable, token.isMatchable});
              }
                first = false;
                if(token.isLinkable){
                    lookup = true;
                } else if (token.isMatchable){
                    lastMatchable = token.index;
                    lastIndex = lastMatchable;
                } //else if(token.hasAlphaNumeric){
                //    lastIndex = token.index;
                //}
            }
            //lookahead
            if(!lookup && lastIndex >= 0 && sectionData != null){
                List<TokenData> tokens = sectionData.getTokens();
                int maxLookahead = Math.max(lastIndex, lastMatchable+3);
                for(int i = lastIndex+1;!lookup && i < maxLookahead && i < tokens.size(); i++){
                    token = tokens.get(i);
                    if(token.isLinkable){
                        lookup = true;
                    } else if(token.isMatchable && (i+1) == maxLookahead){
                        maxLookahead++; //increase lookahead for matchable tokens
                    }
                }
            }
            this.taggable.setTaggable(lookup);
            if(lookup){
                if(log.isTraceEnabled()){
                    TokenData t = getToken();
                    log.trace("lookup: token [{},{}]: {} | word [{},{}]:{}", new Object[]{
                            offset.startOffset(), offset.endOffset(), termAtt,
                            t.token.getStart(), t.token.getEnd(),
                            t.getTokenText()});
                }
                lookupCount++;
            }
            return true;
        } else {

View Full Code Here

Examples of org.apache.stanbol.enhancer.engines.entitylinking.impl.TokenData

        if(tokensCursor >= tokens.size()-1){
            if(!incrementTokenData()){ //adds a new token to the list
                return null; //EoF
            }
        }
        TokenData cursorToken = tokens.get(tokensCursor+1);
        if(cursorToken.token.getStart() < endOffset){
            tokensCursor++; //set the next token as current
            return cursorToken; //and return it
        } else {
            return null;

View Full Code Here

Examples of org.apache.stanbol.enhancer.engines.entitylinking.impl.TokenData

                if(log.isTraceEnabled()){
                    CharSequence tagSequence = at.getText().subSequence(start, end);
                    log.trace(" > reduce tag {} - no overlapp with linkable token", tagSequence);
                }
            } else { //if the tag overlaps a linkable token 
                TokenData linkableToken = linkableTokenContext.linkableToken;
                List<TokenData> tokens = linkableTokenContext.context;
                ChunkData cd = linkableToken.inChunk; //check if it maches > 50% of the chunk
                 if(!lpc.isIgnoreChunks() && cd != null &&
                        cd.isProcessable){
                    int cstart = cd.getMatchableStartChar() >= 0 ? cd.getMatchableStartChar() :
                        start;
                    int cend = cd.getMatchableEndChar();
                    if(cstart < start || cend > end){ //if the tag does not cover the whole chunk
                        int num = 0;
                        int match = 0;
                        for(int i = cd.getMatchableStart(); i <= cd.getMatchableEnd(); i++){
                            TokenData td = tokens.get(i);
                            if(td.isMatchable){
                                num++;
                                if(match < 1 && td.token.getStart() >= start ||
                                        match > 0 && td.token.getEnd() <= end){
                                    match++;

View Full Code Here

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.