There are a few small incompatibilities between the two implementations. Java-specific constructs in the regular expression syntax (e.g. [a-z&&[^bc]], (?<=foo), \A, \Q) work only in the pure Java implementation, not the GWT implementation, and are not rejected by either. Also, the Javascript-specific constructs $` and $' in the replacement expression work only in the GWT implementation, not the pure Java implementation, which rejects them.
Automaton. Regular expressions are built from the following abstract syntax:
| regexp | ::= | unionexp | ||
| | | ||||
| unionexp | ::= | interexp | unionexp | (union) | |
| | | interexp | |||
| interexp | ::= | concatexp & interexp | (intersection) | [OPTIONAL] |
| | | concatexp | |||
| concatexp | ::= | repeatexp concatexp | (concatenation) | |
| | | repeatexp | |||
| repeatexp | ::= | repeatexp ? | (zero or one occurrence) | |
| | | repeatexp * | (zero or more occurrences) | ||
| | | repeatexp + | (one or more occurrences) | ||
| | | repeatexp {n} | (n occurrences) | ||
| | | repeatexp {n,} | (n or more occurrences) | ||
| | | repeatexp {n,m} | (n to m occurrences, including both) | ||
| | | complexp | |||
| complexp | ::= | ~ complexp | (complement) | [OPTIONAL] |
| | | charclassexp | |||
| charclassexp | ::= | [ charclasses ] | (character class) | |
| | | [^ charclasses ] | (negated character class) | ||
| | | simpleexp | |||
| charclasses | ::= | charclass charclasses | ||
| | | charclass | |||
| charclass | ::= | charexp - charexp | (character range, including end-points) | |
| | | charexp | |||
| simpleexp | ::= | charexp | ||
| | | . | (any single character) | ||
| | | # | (the empty language) | [OPTIONAL] | |
| | | @ | (any string) | [OPTIONAL] | |
| | | " <Unicode string without double-quotes> " | (a string) | ||
| | | ( ) | (the empty string) | ||
| | | ( unionexp ) | (precedence override) | ||
| | | < <identifier> > | (named automaton) | [OPTIONAL] | |
| | | <n-m> | (numerical interval) | [OPTIONAL] | |
| charexp | ::= | <Unicode character> | (a single non-reserved character) | |
| | | \ <Unicode character> | (a single character) |
The productions marked [OPTIONAL] are only allowed if specified by the syntax flags passed to the RegExp constructor. The reserved characters used in the (enabled) syntax must be escaped with backslash (\) or double-quotes ("..."). (In contrast to other regexp syntaxes, this is required also in character classes.) Be aware that dash (-) has a special meaning in charclass expressions. An identifier is a string not containing right angle bracket (>) or dash (-). Numerical intervals are specified by non-negative decimal integers and include both end points, and if n and m have the same number of digits, then the conforming strings must have that length (i.e. prefixed by 0's).
@author Anders Møller <amoeller@cs.au.dk>
Automaton. Regular expressions are built from the following abstract syntax:
| regexp | ::= | unionexp | ||
| | | ||||
| unionexp | ::= | interexp | unionexp | (union) | |
| | | interexp | |||
| interexp | ::= | concatexp & interexp | (intersection) | [OPTIONAL] |
| | | concatexp | |||
| concatexp | ::= | repeatexp concatexp | (concatenation) | |
| | | repeatexp | |||
| repeatexp | ::= | repeatexp ? | (zero or one occurrence) | |
| | | repeatexp * | (zero or more occurrences) | ||
| | | repeatexp + | (one or more occurrences) | ||
| | | repeatexp {n} | (n occurrences) | ||
| | | repeatexp {n,} | (n or more occurrences) | ||
| | | repeatexp {n,m} | (n to m occurrences, including both) | ||
| | | complexp | |||
| complexp | ::= | ~ complexp | (complement) | [OPTIONAL] |
| | | charclassexp | |||
| charclassexp | ::= | [ charclasses ] | (character class) | |
| | | [^ charclasses ] | (negated character class) | ||
| | | simpleexp | |||
| charclasses | ::= | charclass charclasses | ||
| | | charclass | |||
| charclass | ::= | charexp - charexp | (character range, including end-points) | |
| | | charexp | |||
| simpleexp | ::= | charexp | ||
| | | . | (any single character) | ||
| | | # | (the empty language) | [OPTIONAL] | |
| | | @ | (any string) | [OPTIONAL] | |
| | | " <Unicode string without double-quotes> " | (a string) | ||
| | | ( ) | (the empty string) | ||
| | | ( unionexp ) | (precedence override) | ||
| | | < <identifier> > | (named automaton) | [OPTIONAL] | |
| | | <n-m> | (numerical interval) | [OPTIONAL] | |
| charexp | ::= | <Unicode character> | (a single non-reserved character) | |
| | | \ <Unicode character> | (a single character) |
The productions marked [OPTIONAL] are only allowed if specified by the syntax flags passed to the RegExp constructor. The reserved characters used in the (enabled) syntax must be escaped with backslash (\) or double-quotes ("..."). (In contrast to other regexp syntaxes, this is required also in character classes.) Be aware that dash (-) has a special meaning in charclass expressions. An identifier is a string not containing right angle bracket (>) or dash (-). Numerical intervals are specified by non-negative decimal integers and include both end points, and if n and m have the same number of digits, then the conforming strings must have that length (i.e. prefixed by 0's).
@lucene.experimental
A regular expression is zero or more branches, separated by "|". It matches anything that matches one of the branches.
A branch is zero or more pieces, concatenated. It matches a match for the first piece, followed by a match for the second piece, etc.
A piece is an atom, possibly followed by "*", "+", or "?".
An atom is
range (see below) A range is a sequence of characters enclosed in "[]". The range normally matches any single character from the sequence. If the sequence begins with "^", the range matches any single character not from the rest of the sequence. If two characters in the sequence are separated by "-", this is shorthand for the full list of characters between them (e.g. "[0-9]" matches any decimal digit). To include a literal "]" in the sequence, make it the first character (following a possible "^"). To include a literal "-", make it the first or last character.
In general there may be more than one way to match a regular expression to an input string. For example, consider the command
String[] match = new String[2]; Regexp.match("(a*)b*", "aabaaabb", match); Considering only the rules given so far, match[0] and match[1] could end up with the values In the example from above, "(a*)b*" therefore matches exactly "aab"; the "(a*)" portion of the pattern is matched first and it consumes the leading "aa", then the "b*" portion of the pattern consumes the next "b". Or, consider the following example:
String match = new String[3]; Regexp.match("(ab|a)(b*)c", "abc", match); After this command, match[0] will be "abc", match[1] will be "ab", and match[2] will be an empty string. Rule 4 specifies that the "(ab|a)" component gets first shot at the input string and Rule 2 specifies that the "ab" sub-expression is checked before the "a" sub-expression. Thus the "b" has already been claimed before the "(b*)" component is checked and therefore "(b*)" must match an empty string. Regular expression substitution matches a string against a regular expression, transforming the string by replacing the matched region(s) with new substring(s).
What gets substituted into the result is controlled by a subspec. The subspec is a formatting string that specifies what portions of the matched region should be substituted into the result.
n", where n is a digit from 1 to 9, is replaced with a copy of the nth subexpression. backslash and "2", not the Unicode character 0002. public static void main(String[] args) throws Exception { Regexp re; String[] matches; String s; / * A regular expression to match the first line of a HTTP request. * * 1. ^ - starting at the beginning of the line * 2. ([A-Z]+) - match and remember some upper case characters * 3. [ \t]+ - skip blank space * 4. ([^ \t]*) - match and remember up to the next blank space * 5. [ \t]+ - skip more blank space * 6. (HTTP/1\\.[01]) - match and remember HTTP/1.0 or HTTP/1.1 * 7. $ - end of string - no chars left. */ s = "GET http://a.b.com:1234/index.html HTTP/1.1"; re = new Regexp("^([A-Z]+)[ \t]+([^ \t]+)[ \t]+(HTTP/1\\.[01])$"); matches = new String[4]; if (re.match(s, matches)) { System.out.println("METHOD " + matches[1]); System.out.println("URL " + matches[2]); System.out.println("VERSION " + matches[3]); } / * A regular expression to extract some simple comma-separated data, * reorder some of the columns, and discard column 2. */ s = "abc,def,ghi,klm,nop,pqr"; re = new Regexp("^([^,]+),([^,]+),([^,]+),(.*)"); System.out.println(re.sub(s, "\\3,\\1,\\4")); }
@author Colin Stevens (colin.stevens@sun.com)
@version 2.3
@see Regsub
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |