Examples of UnicodeSet

com.ibm.icu.text.UnicodeSet

cu-project.org/userguide/unicodeSet.html"> http://www.icu-project.org/userguide/unicodeSet.html. Actual determination of property data is defined by the underlying Unicode database as implemented by UCharacter.

Patterns specify individual characters, ranges of characters, and Unicode property sets. When elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '['. Property patterns are inverted by modifying their delimiters; "[:^foo]" and "\P{foo}". In any other location, '^' has no special meaning.

Ranges are indicated by placing two a '-' between two characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left character is greater than or equal to the right character it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', then it is taken as a literal. Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same set of three characters, 'a', 'b', and '-'.

Sets may be intersected using the '&' operator or the asymmetric set difference may be taken using the '-' operator, for example, "[[:L:]&[\\u0000-\\u0FFF]]" indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. Thus "[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to "[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]". This only really matters for difference; intersection is commutative.

`[a]`	The set containing 'a'
`[a-z]`	The set containing 'a' through 'z' and all letters in between, in Unicode order
`[^a-z]`	The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF
`[[pat1][pat2]]`	The union of sets specified by pat1 and pat2
`[[pat1]&[pat2]]`	The intersection of sets specified by pat1 and pat2
`[[pat1]-[pat2]]`	The asymmetric difference of sets specified by pat1 and pat2
`[:Lu:] or \p{Lu}`	The set of characters having the specified Unicode property; in this case, Unicode uppercase letters
`[:^Lu:] or \P{Lu}`	The set of characters not having the given Unicode property

Warning: you cannot add an empty string ("") to a UnicodeSet.

Formal syntax

pattern := ('[' '^'? item* ']') | property

item := char | (char '-' char) | pattern-expr

pattern-expr := pattern | pattern-expr pattern | pattern-expr op pattern

op := '&' | '-'

special := '[' | ']' | '-'

char := any character that is notspecial | ('\\'any character) | ('\u' hex hex hex hex)

hex := any character for which Character.digit(c, 16) returns a non-negative result

property := a Unicode property set pattern

Legend:

a := b a may be replaced by b

a? zero or one instance of a

a* one or more instances of a

a | b either a or b

'a' the literal string between the quotes

To iterate over contents of UnicodeSet, use UnicodeSetIterator class. @author Alan Liu @stable ICU 2.0 @see UnicodeSetIterator

Examples of com.ibm.icu.text.UnicodeSet

  }


  // we have to carefully output the possibilities as compact utf-16
  // range expressions, or jflex will OOM!
  static void outputMacro(String name, String pattern) {
    UnicodeSet set = new UnicodeSet(pattern);
    set.removeAll(BMP);
    System.out.println(name + " = (");
    // if the set is empty, we have to do this or jflex will barf
    if (set.isEmpty()) {
      System.out.println("\t  []");
    }


    HashMap<Character,UnicodeSet> utf16ByLead = new HashMap<Character,UnicodeSet>();
    for (UnicodeSetIterator it = new UnicodeSetIterator(set); it.next();) {
      char utf16[] = Character.toChars(it.codepoint);
      UnicodeSet trails = utf16ByLead.get(utf16[0]);
      if (trails == null) {
        trails = new UnicodeSet();
        utf16ByLead.put(utf16[0], trails);
      }
      trails.add(utf16[1]);
    }
    
    Map<String,UnicodeSet> utf16ByTrail = new HashMap<String,UnicodeSet>();
    for (Map.Entry<Character,UnicodeSet> entry : utf16ByLead.entrySet()) {
      String trail = entry.getValue().getRegexEquivalent();
      UnicodeSet leads = utf16ByTrail.get(trail);
      if (leads == null) {
        leads = new UnicodeSet();
        utf16ByTrail.put(trail, leads);
      }
      leads.add(entry.getKey());
    }


    boolean isFirst = true;
    for (Map.Entry<String,UnicodeSet> entry : utf16ByTrail.entrySet()) {
      System.out.print( isFirst ? "\t  " : "\t| ");

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

  }
  
  // we have to carefully output the possibilities as compact utf-16
  // range expressions, or jflex will OOM!
  static void outputMacro(String name, String pattern) {
    UnicodeSet set = new UnicodeSet(pattern);
    set.removeAll(BMP);
    System.out.println(name + " = (");
    // if the set is empty, we have to do this or jflex will barf
    if (set.isEmpty()) {
      System.out.println("\t  []");
    }
    
    HashMap<Character,UnicodeSet> utf16ByLead = new HashMap<Character,UnicodeSet>();
    for (UnicodeSetIterator it = new UnicodeSetIterator(set); it.next();) {    
      char utf16[] = Character.toChars(it.codepoint);
      UnicodeSet trails = utf16ByLead.get(utf16[0]);
      if (trails == null) {
        trails = new UnicodeSet();
        utf16ByLead.put(utf16[0], trails);
      }
      trails.add(utf16[1]);
    }
    
    boolean isFirst = true;
    for (Character c : utf16ByLead.keySet()) {
      UnicodeSet trail = utf16ByLead.get(c);
      System.out.print( isFirst ? "\t  " : "\t| ");
      isFirst = false;
      System.out.println("([\\u" + Integer.toHexString(c) + "]" + trail.getRegexEquivalent() + ")");
    }
    System.out.println(")");
  }

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

    /* Initialize the two UnicodeSets use for proper Gurmukhi conversion if they have not already been created. */
    private void initializePNJSets() {
        if (PNJ_BINDI_TIPPI_SET != null && PNJ_CONSONANT_SET != null) {
            return;
        }
        PNJ_BINDI_TIPPI_SET = new UnicodeSet();
        PNJ_CONSONANT_SET = new UnicodeSet();
        
        PNJ_CONSONANT_SET.add(0x0a15, 0x0a28);
        PNJ_CONSONANT_SET.add(0x0a2a, 0x0a30);
        PNJ_CONSONANT_SET.add(0x0a35, 0x0a36);
        PNJ_CONSONANT_SET.add(0x0a38, 0x0a39);

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

                    PropsVectors.ERROR_VALUE_CP, col, ~0, ~0);
        }


        for (int i = 0; i < encodings.length; ++i) {
            Charset testCharset = CharsetICU.forNameICU(encodings[i]);
            UnicodeSet unicodePointSet = new UnicodeSet(); // empty set
            ((CharsetICU) testCharset).getUnicodeSet(unicodePointSet,
                    mappingTypes);
            int column = i / 32;
            int mask = 1 << (i % 32);
            // now iterate over intervals on set i
            int itemCount = unicodePointSet.getRangeCount();
            for (int j = 0; j < itemCount; ++j) {
                int startChar = unicodePointSet.getRangeStart(j);
                int endChar = unicodePointSet.getRangeEnd(j);
                pvec.setValue(startChar, endChar, column, ~0, mask);
            }
        }


        // handle excluded encodings

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

        }


        // generate a list of all caseless characters -- characters whose
        // case closure is themselves.


        UnicodeSet caseless = new UnicodeSet();


        for (int i = 0; i <= 0x10FFFF; ++i) {
            String cp = UTF16.valueOf(i);
            ci.reset(cp);
            int count = 0;
            String fold = null;
            for (String temp = ci.next(); temp != null; temp = ci.next()) {
                fold = temp;
                if (++count > 1) break;
            }
            if (count==1 && fold.equals(cp)) {
                caseless.add(i);
            }
        }


        System.out.println("caseless = " + caseless.toPattern(true));


        UnicodeSet not_lc = new UnicodeSet("[:^lc:]");
        
        UnicodeSet a = new UnicodeSet();
        a.set(not_lc);
        a.removeAll(caseless);
        System.out.println("[:^lc:] - caseless = " + a.toPattern(true));


        a.set(caseless);
        a.removeAll(not_lc);
        System.out.println("caseless - [:^lc:] = " + a.toPattern(true));
    }

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

        String[] array = new String[]{"a", "b", "c", "{de}"};
        List list = Arrays.asList(array);
        Set aset = new HashSet(list);
        logln(" *** The source set's size is: " + aset.size());
    //The size reads 4
        UnicodeSet set = new UnicodeSet();
        set.clear();
        set.addAll(aset);
        logln(" *** After addAll, the UnicodeSet size is: " + set.size());
    //The size should also read 4, but 0 is seen instead


    }

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

     * @param value
     * @param result
     * @return result
     */
    public UnicodeSet getSet(Object value, UnicodeSet result) {
        if (result == null) result = new UnicodeSet();
        for (int i = 0; i < length - 1; ++i) {
            if (areEqual(value, values[i])) {
                result.add(transitions[i], transitions[i+1]-1);
            } 
        }

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

            }
        } else {
            Set set = (Set) getAvailableValues(new TreeSet(collected));
            for (Iterator it = set.iterator(); it.hasNext();) {
                Object value = it.next();
                UnicodeSet s = getSet(value);
                result.append(value)
                .append("\t=> ")
                .append(s.toPattern(true))
                .append("\r\n");
            }
        }
        return result.toString();
    }

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet


    public void TestTitleRegression() throws java.io.IOException {
        UCaseProps props = new UCaseProps();
        int type = props.getTypeOrIgnorable('\'');
        assertEquals("Case Ignorable check", -1, type); // should be case-ignorable (-1)
        UnicodeSet allCaseIgnorables = new UnicodeSet();
        for (int cp = 0; cp <= 0x10FFFF; ++cp) {
            if (props.getTypeOrIgnorable(cp) < 0) {
                allCaseIgnorables.add(cp);
            }
        }
        logln(allCaseIgnorables.toString());
        assertEquals("Titlecase check",
                "The Quick Brown Fox Can't Jump Over The Lazy Dogs.",
                UCharacter.toTitleCase(ULocale.ENGLISH, "THE QUICK BROWN FOX CAN'T JUMP OVER THE LAZY DOGS.", null));
    }

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

                if(locale.toString().indexOf(("in"))<0){
                    errln("UScript.getCode returned null for locale: " + locale); 
                }
                continue;
            }
            UnicodeSet exemplarSets[] = new UnicodeSet[2];
            for (int k=0; k<2; ++k) {   // for casing option in (normal, caseInsensitive)
                int option = (k==0) ? 0 : UnicodeSet.CASE;
                UnicodeSet exemplarSet = LocaleData.getExemplarSet(locale, option);
                exemplarSets[k] = exemplarSet;
                ExemplarGroup exGrp = new ExemplarGroup(exemplarSet, scriptCodes);
                if (!testedExemplars.contains(exGrp)) {
                    testedExemplars.add(exGrp);
                    UnicodeSet[] sets = new UnicodeSet[scriptCodes.length];
                    // create the UnicodeSets for the script
                    for(int j=0; j < scriptCodes.length; j++){
                        sets[j] = new UnicodeSet("[:" + UScript.getShortName(scriptCodes[j]) + ":]");
                    }
                    boolean existsInScript = false;
                    UnicodeSetIterator iter = new UnicodeSetIterator(exemplarSet);
                    // iterate over the 
                    while (!existsInScript && iter.nextRange()) {

View Full Code Here

0 1 2 3 4 5

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.

`pattern :=`	`('[' '^'? item* ']') \| property`
`item :=`	`char \| (char '-' char) \| pattern-expr`
`pattern-expr :=`	`pattern \| pattern-expr pattern \| pattern-expr op pattern`
`op :=`	`'&' \| '-'`
`special :=`	`'[' \| ']' \| '-'`
`char :=`	any character that is not`special \| ('\\'`any character`) \| ('\u' hex hex hex hex)`
`hex :=`	any character for which `Character.digit(c, 16)` returns a non-negative result
`property :=`	a Unicode property set pattern