Examples of UnicodeSet

com.ibm.icu.text.UnicodeSet

cu-project.org/userguide/unicodeSet.html"> http://www.icu-project.org/userguide/unicodeSet.html. Actual determination of property data is defined by the underlying Unicode database as implemented by UCharacter.

Patterns specify individual characters, ranges of characters, and Unicode property sets. When elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '['. Property patterns are inverted by modifying their delimiters; "[:^foo]" and "\P{foo}". In any other location, '^' has no special meaning.

Ranges are indicated by placing two a '-' between two characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left character is greater than or equal to the right character it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', then it is taken as a literal. Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same set of three characters, 'a', 'b', and '-'.

Sets may be intersected using the '&' operator or the asymmetric set difference may be taken using the '-' operator, for example, "[[:L:]&[\\u0000-\\u0FFF]]" indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. Thus "[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to "[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]". This only really matters for difference; intersection is commutative.

`[a]`	The set containing 'a'
`[a-z]`	The set containing 'a' through 'z' and all letters in between, in Unicode order
`[^a-z]`	The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF
`[[pat1][pat2]]`	The union of sets specified by pat1 and pat2
`[[pat1]&[pat2]]`	The intersection of sets specified by pat1 and pat2
`[[pat1]-[pat2]]`	The asymmetric difference of sets specified by pat1 and pat2
`[:Lu:] or \p{Lu}`	The set of characters having the specified Unicode property; in this case, Unicode uppercase letters
`[:^Lu:] or \P{Lu}`	The set of characters not having the given Unicode property

Warning: you cannot add an empty string ("") to a UnicodeSet.

Formal syntax

pattern := ('[' '^'? item* ']') | property

item := char | (char '-' char) | pattern-expr

pattern-expr := pattern | pattern-expr pattern | pattern-expr op pattern

op := '&' | '-'

special := '[' | ']' | '-'

char := any character that is notspecial | ('\\'any character) | ('\u' hex hex hex hex)

hex := any character for which Character.digit(c, 16) returns a non-negative result

property := a Unicode property set pattern

Legend:

a := b a may be replaced by b

a? zero or one instance of a

a* one or more instances of a

a | b either a or b

'a' the literal string between the quotes

To iterate over contents of UnicodeSet, use UnicodeSetIterator class. @author Alan Liu @stable ICU 2.0 @see UnicodeSetIterator

Examples of com.ibm.icu.text.UnicodeSet

                if(locale.toString().indexOf(("in"))<0){
                    errln("UScript.getCode returned null for locale: "+ locale); 
                }
                continue;
            }
            UnicodeSet exemplarSets[] = new UnicodeSet[4];


            for (int k=0; k<2; ++k) {  // for casing option in (normal, uncased)
                int option = (k==0) ? 0 : UnicodeSet.CASE;
                for(int h=0; h<2; ++h){  
                    int type = (h==0) ? LocaleData.ES_STANDARD : LocaleData.ES_AUXILIARY;


                    UnicodeSet exemplarSet = ld.getExemplarSet(option, type);
                    exemplarSets[k*2+h] = exemplarSet;


                    ExemplarGroup exGrp = new ExemplarGroup(exemplarSet, scriptCodes);
                    if (!testedExemplars.contains(exGrp)) {
                        testedExemplars.add(exGrp);
                        UnicodeSet[] sets = new UnicodeSet[scriptCodes.length];
                        // create the UnicodeSets for the script
                        for(int j=0; j < scriptCodes.length; j++){
                            sets[j] = new UnicodeSet("[:" + UScript.getShortName(scriptCodes[j]) + ":]");
                        }
                        boolean existsInScript = false;
                        UnicodeSetIterator iter = new UnicodeSetIterator(exemplarSet);
                        // iterate over the 
                        while (!existsInScript && iter.nextRange()) {

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

    
    public RandomCollator() {
        
    }
    protected void init()throws Exception{
        init(1,10, new UnicodeSet("[AZa-z<\\&\\[\\]]"));
    }

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

        }
        return other.equals(UTF16.valueOf(codepoint));
    }
    
    public UnicodeSet getPropertySet(boolean charEqualsValue, UnicodeSet result){
        if (result == null) result = new UnicodeSet();
        matchIterator.reset();
        while (matchIterator.next()) {
            String value = filter.remap(getPropertyValue(matchIterator.codepoint));
            if (equals(matchIterator.codepoint, value) == charEqualsValue) {
                result.add(matchIterator.codepoint);

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

        }
        return result;
    }


    public UnicodeSet getPropertySet(String propertyValue, UnicodeSet result){
        if (result == null) result = new UnicodeSet();
        matchIterator.reset();
        while (matchIterator.next()) {
            String value = filter.remap(getPropertyValue(matchIterator.codepoint));
            if (propertyValue.equals(value)) {
                result.add(matchIterator.codepoint);

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

        }
        return result;
    }


    public UnicodeSet getPropertySet(Matcher matcher, UnicodeSet result) {
        if (result == null) result = new UnicodeSet();
        matchIterator.reset();
        while (matchIterator.next()) {
            String value = filter.remap(getPropertyValue(matchIterator.codepoint));
            if (value == null)
                continue;

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

            }
        }
    }
    
    public UnicodeSet getMatchSet(UnicodeSet result) {
        if (result == null) result = new UnicodeSet();
        addAll(matchIterator, result);
        return result;
    }

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

        output.println("};\n");
    }
    
    public void writeMirroredDataFile(String filename)
    {
        UnicodeSet mirrored = new UnicodeSet("[\\p{Bidi_Mirrored}]");
        int count = mirrored.size();
        int[] chars   = new int[count];
        int[] mirrors = new int[count];
        int total = 0;
        
        System.out.println("There are " + count + " mirrored characters.");
        
        for(int i = 0; i < count; i += 1) {
            int ch = mirrored.charAt(i);
            int m  = UCharacter.getMirror(ch);
            
            if (ch != m) {
                chars[total] = ch & 0xFFFF;
                mirrors[total++] = m & 0xFFFF;

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

    
    private static void buildArabicTables(ScriptList scriptList, FeatureList featureList,
                                                LookupList lookupList, ClassTable classTable) {
        // TODO: Might want to have the ligature table builder explicitly check for ligatures
        // which start with space and tatweel rather than pulling them out here...
        UnicodeSet arabicBlock   = new UnicodeSet("[[\\p{block=Arabic}] & [[:Cf:][:Po:][:So:][:Mn:][:Nd:][:Lm:]]]");
        UnicodeSet oddLigatures  = new UnicodeSet("[\\uFC5E-\\uFC63\\uFCF2-\\uFCF4\\uFE70-\\uFE7F]");
        UnicodeSet arabicLetters = new UnicodeSet("[\\p{Arabic}]");
        ArabicCharacterData arabicData = ArabicCharacterData.factory(arabicLetters.addAll(arabicBlock).removeAll(oddLigatures));


        addArabicGlyphClasses(arabicData, classTable);
        
        ClassTable initClassTable = new ClassTable();
        ClassTable mediClassTable = new ClassTable();

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

     * Hebrew mark order taken from the SBL Hebrew Font manual
     * Arabic mark order per Thomas Milo: hamza < shadda < combining_alef < sukun, vowel_marks < madda < qur'anic_marks
     */
    public static ClassTable buildCombiningClassTable()
    {
        UnicodeSet markSet = new UnicodeSet("[\\P{CanonicalCombiningClass=0}]");
        ClassTable exceptions = new ClassTable();
        ClassTable combiningClasses = new ClassTable();
        int markCount = markSet.size();
        
        exceptions.addMapping(0x05C1,  10); // Point Shin Dot
        exceptions.addMapping(0x05C2,  11); // Point Sin Dot
        exceptions.addMapping(0x05BC,  21); // Point Dagesh or Mapiq
        exceptions.addMapping(0x05BF,  23); // Point Rafe
        exceptions.addMapping(0x05B9,  27); // Point Holam
        exceptions.addMapping(0x0323, 220); // Comb. Dot Below (low punctum)
        exceptions.addMapping(0x0591, 220); // Accent Etnahta
        exceptions.addMapping(0x0596, 220); // Accent Tipeha
        exceptions.addMapping(0x059B, 220); // Accent Tevir
        exceptions.addMapping(0x05A3, 220); // Accent Munah
        exceptions.addMapping(0x05A4, 220); // Accent Mahapakh
        exceptions.addMapping(0x05A5, 220); // Accent Merkha
        exceptions.addMapping(0x05A6, 220); // Accent Merkha Kefula
        exceptions.addMapping(0x05A7, 220); // Accent Darga
        exceptions.addMapping(0x05AA, 220); // Accent Yerah Ben Yomo
        exceptions.addMapping(0x05B0, 220); // Point Sheva
        exceptions.addMapping(0x05B1, 220); // Point Hataf Segol
        exceptions.addMapping(0x05B2, 220); // Point Hataf Patah
        exceptions.addMapping(0x05B3, 220); // Point Hataf Qamats
        exceptions.addMapping(0x05B4, 220); // Point Hiriq
        exceptions.addMapping(0x05B5, 220); // Point Tsere
        exceptions.addMapping(0x05B6, 220); // Point Segol
        exceptions.addMapping(0x05B7, 220); // Point Patah
        exceptions.addMapping(0x05B8, 220); // Point Qamats
        exceptions.addMapping(0x05BB, 220); // Point Qubuts
        exceptions.addMapping(0x05BD, 220); // Point Meteg
        exceptions.addMapping(0x059A, 222); // Accent Yetiv
        exceptions.addMapping(0x05AD, 222); // Accent Dehi
        exceptions.addMapping(0x05C4, 230); // Mark Upper Dot (high punctum)
        exceptions.addMapping(0x0593, 230); // Accent Shalshelet
        exceptions.addMapping(0x0594, 230); // Accent Zaqef Qatan
        exceptions.addMapping(0x0595, 230); // Accent Zaqef Gadol
        exceptions.addMapping(0x0597, 230); // Accent Revia
        exceptions.addMapping(0x0598, 230); // Accent Zarqa
        exceptions.addMapping(0x059F, 230); // Accent Qarney Para
        exceptions.addMapping(0x059E, 230); // Accent Gershayim
        exceptions.addMapping(0x059D, 230); // Accent Geresh Muqdam
        exceptions.addMapping(0x059C, 230); // Accent Geresh
        exceptions.addMapping(0x0592, 230); // Accent Segolta
        exceptions.addMapping(0x05A0, 230); // Accent Telisha Gedola
        exceptions.addMapping(0x05AC, 230); // Accent Iluy
        exceptions.addMapping(0x05A8, 230); // Accent Qadma
        exceptions.addMapping(0x05AB, 230); // Accent Ole
        exceptions.addMapping(0x05AF, 230); // Mark Masora Circle
        exceptions.addMapping(0x05A1, 230); // Accent Pazer
      //exceptions.addMapping(0x0307, 230); // Mark Number/Masora Dot
        exceptions.addMapping(0x05AE, 232); // Accent Zinor
        exceptions.addMapping(0x05A9, 232); // Accent Telisha Qetana
        exceptions.addMapping(0x0599, 232); // Accent Pashta
        
        exceptions.addMapping(0x0655,  27); // ARABIC HAMZA BELOW
        exceptions.addMapping(0x0654,  27); // ARABIC HAMZA ABOVE


        exceptions.addMapping(0x0651,  28); // ARABIC SHADDA


        exceptions.addMapping(0x0656,  29); // ARABIC SUBSCRIPT ALEF
        exceptions.addMapping(0x0670,  29); // ARABIC LETTER SUPERSCRIPT ALEF


        exceptions.addMapping(0x064D,  30); // ARABIC KASRATAN
        exceptions.addMapping(0x0650,  30); // ARABIC KASRA


        exceptions.addMapping(0x0652,  31); // ARABIC SUKUN
        exceptions.addMapping(0x06E1,  31); // ARABIC SMALL HIGH DOTLESS HEAD OF KHAH


        exceptions.addMapping(0x064B,  31); // ARABIC FATHATAN
        exceptions.addMapping(0x064C,  31); // ARABIC DAMMATAN
        exceptions.addMapping(0x064E,  31); // ARABIC FATHA
        exceptions.addMapping(0x064F,  31); // ARABIC DAMMA
        exceptions.addMapping(0x0657,  31); // ARABIC INVERTED DAMMA
        exceptions.addMapping(0x0658,  31); // ARABIC MARK NOON GHUNNA


        exceptions.addMapping(0x0653,  32); // ARABIC MADDAH ABOVE
        
        exceptions.snapshot();
        
        for (int i = 0; i < markCount; i += 1) {
            int mark = markSet.charAt(i);
            int markClass = exceptions.getGlyphClassID(mark);
            
            if (markClass == 0) {
                markClass = UCharacter.getCombiningClass(mark);
            }

View Full Code Here

Examples of com.ibm.icu.text.UnicodeSet

    
    public static void buildDecompTables(String fileName)
    {
        // F900 - FAFF are compatibility ideographs. They all decompose to a single other character, and can be ignored.
      //UnicodeSet decompSet = new UnicodeSet("[[[\\P{Hangul}] & [\\p{DecompositionType=Canonical}]] - [\uF900-\uFAFF]]");
        UnicodeSet decompSet = new UnicodeSet("[[\\p{DecompositionType=Canonical}] & [\\P{FullCompositionExclusion}] & [\\P{Hangul}]]");
        CanonicalCharacterData data = CanonicalCharacterData.factory(decompSet);
        ClassTable classTable = new ClassTable();
        
        LookupList  lookupList  = new LookupList();
        FeatureList featureList = new FeatureList();

View Full Code Here

0 1 2 3 4 5

TOP

All source code are property of their respective owners. Java is a trademark of Sun Microsystems, Inc and owned by ORACLE Inc. Contact coftware#gmail.com.

`pattern :=`	`('[' '^'? item* ']') \| property`
`item :=`	`char \| (char '-' char) \| pattern-expr`
`pattern-expr :=`	`pattern \| pattern-expr pattern \| pattern-expr op pattern`
`op :=`	`'&' \| '-'`
`special :=`	`'[' \| ']' \| '-'`
`char :=`	any character that is not`special \| ('\\'`any character`) \| ('\u' hex hex hex hex)`
`hex :=`	any character for which `Character.digit(c, 16)` returns a non-negative result
`property :=`	a Unicode property set pattern