info.codesaway.util.regex
Class Pattern

java.lang.Object
  extended by info.codesaway.util.regex.Pattern
All Implemented Interfaces:
java.io.Serializable

public final class Pattern
extends java.lang.Object
implements java.io.Serializable

A compiled representation of a regular expression.

This class is an extension of Java's Pattern class. Javadocs were copied and appended with the added functionality.

A regular expression, specified as a string, must first be compiled into an instance of this class. The resulting pattern can then be used to create a Matcher object that can match arbitrary character sequences against the regular expression. All of the state involved in performing a match resides in the matcher, so many matchers can share the same pattern.

A typical invocation sequence is thus

 Pattern p = Pattern.compile("a*b");
 Matcher m = p.matcher("aaaaab");
 boolean b = m.matches();

A matches method is defined by this class as a convenience for when a regular expression is used just once. This method compiles an expression and matches an input sequence against it in a single invocation. The statement

 boolean b = Pattern.matches("a*b", "aaaaab");
is equivalent to the three statements above, though for repeated matches it is less efficient since it does not allow the compiled pattern to be reused.

Instances of this class are immutable and are safe for use by multiple concurrent threads. Instances of the Matcher class are not safe for such use.

Summary of regular-expression constructs

Construct Matches
 
Characters
x The character x
\\ The backslash character
\0n The character with octal value 0n (0 <= n <= 7)
\0nn The character with octal value 0nn (0 <= n <= 7)
\0mnn The character with octal value 0mnn (0 <= m <= 3, 0  <= n <= 7)
\xhh The character with hexadecimal value 0xhh
\x{hhh..} The character with hexadecimal value 0xhhh..
\uhhhh The character with hexadecimal value 0xhhhh
\t The tab character ('\u0009')
\n The newline (line feed) character ('\u000A')
\r The carriage-return character ('\u000D')
\f The form-feed character ('\u000C')
\a The alert (bell) character ('\u0007')
\e The escape character ('\u001B')
\cx The control character corresponding to x
 
Character classes
[abc] a, b, or c (simple class)
[^abc] Any character except a, b, or c (negation)
[a-zA-Z] a through z or A through Z, inclusive (range)
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] d, e, or f (intersection)
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z](subtraction)
 
Predefined character classes
. Any character (may or may not match line terminators)
\X Single grapheme - equivalent to (?>\P{M}\p{M}*)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
 
POSIX character classes (US-ASCII only)
\p{Lower} A lower-case alphabetic character: [a-z]
\p{Upper} An upper-case alphabetic character: [A-Z]
\p{ASCII} All ASCII: [\x00-\x7F]
\p{Alpha} An alphabetic character: [\p{Lower}\p{Upper}]
\p{Digit} A decimal digit: [0-9]
\p{Alnum} An alphanumeric character: [\p{Alpha}\p{Digit}]
\p{Punct} Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph} A visible character: [\p{Alnum}\p{Punct}]
\p{Print} A printable character: [\p{Graph}\x20]
\p{Blank} A space or a tab: [ \t]
\p{Cntrl} A control character: [\x00-\x1F\x7F]
\p{XDigit} A hexadecimal digit: [0-9a-fA-F]
\p{Space} A whitespace character: [ \t\n\x0B\f\r]
 
POSIX character classes (US-ASCII only)

(equivalent to the above POSIX classes - only allowed in a character class)

[:lower:] A lower-case alphabetic character: [a-z]
[:upper:] An upper-case alphabetic character: [A-Z]
[:ascii:] All ASCII: [\x00-\x7F]
[:alpha:] An alphabetic character: [[:lower:][:upper:]]
[:digit:] A decimal digit: [0-9]
[:alnum:] An alphanumeric character: [[:alpha:][:digit:]]
[:punct:] Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[:graph:] A visible character: [[:alnum:][:punct:]]
[:print:] A printable character: [[:graph:]\x20]
[:blank:] A space or a tab: [ \t]
[:cntrl:] A control character: [\x00-\x1F\x7F]
[:xdigit:] A hexadecimal digit: [0-9a-fA-F]
[:space:] A whitespace character: [ \t\n\x0B\f\r]
[:word:] A word character: [\w]
 
java.lang.Character classes (simple java character type)
\p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
\p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
\p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
\p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
 
Classes for Unicode blocks and categories
\p{InGreek} A character in the Greek block (simple block)
\p{Lu} An uppercase letter (simple category)
\p{Sc} A currency symbol
\P{InGreek} Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]]  Any letter except an uppercase letter (subtraction)
 
Boundary matchers
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final , if any
\z The end of the input
 
Greedy quantifiers
X? X, once or not at all
X* X, zero or more times
X+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n but not more than m times
 
Reluctant quantifiers
X?? X, once or not at all
X*? X, zero or more times
X+? X, one or more times
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n but not more than m times
 
Possessive quantifiers
X?+ X, once or not at all
X*+ X, zero or more times
X++ X, one or more times
X{n}+ X, exactly n times
X{n,}+ X, at least n times
X{n,m}+ X, at least n but not more than m times
 
Logical operators
XY X followed by Y
X|Y Either X or Y
 
Capturing
(X) X, as a capturing group
(?<name>X) X, as a named-capturing group
(?'name'X) X, as a named-capturing group
(?P<name>X) X, as a named-capturing group
 
Back references
\n Whatever the nth capturing group matched
\gn Whatever the nth capturing group matched
\g{n} Whatever the nth capturing group matched
   
\g-n Relative back reference
\g{-n} Relative back reference
   
\k<name> Whatever the named-capturing group "name" matched
\k'name' Whatever the named-capturing group "name" matched
\g{name} Whatever the named-capturing group "name" matched
\k{name} Whatever the named-capturing group "name" matched
(?P=name) Whatever the named-capturing group "name" matched
 
Quotation
\ Nothing, but quotes the following character
\Q Nothing, but quotes all characters until \E
\E Nothing, but ends quoting started by \Q
 
Special constructs (non-capturing)
(?:X) X, as a non-capturing group
(?idmsuxJn-idmsuxJn)  Nothing, but turns match flags i d m s u x J n on - off
(?idmsuxJn-idmsuxJn:X)   X, as a non-capturing group with the given flags i d m s u x J n on - off
(?>X) X, as an independent (atomic), non-capturing group
(?|X) X, as a "branch reset" pattern
 
Assertions (non-capturing)
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
(?<=X) X, via zero-width positive lookbehind
(?<!X) X, via zero-width negative lookbehind
 
Comment (non-capturing)
(?x:#comment\n) comment (cannot contain a line terminator)
(?xd:#comment\n) comment (cannot contain '\n')
(?#comment) comment (cannot contain a close parenthesis)
 
Conditional patterns (non-capturing)
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
 
(?(n)...) absolute reference condition
(?(-n)...) relative reference condition
   
(?(<name>)...) named reference condition
(?('name')...) named reference condition
(?(name)...) named reference condition
 
(?(assert)...) assert condition
 
Numeric ranges (non-capturing)
(?Z[start..end]) matches a numeric range (allowing for leading zeros)
(?Z16[start..end]) matches a numeric range in base 16 (allowing for leading zeros)
 
(?NZ[start..end]) matches a numeric range (not allowing for leading zeros)
(?NZ16[start..end]) matches a numeric range in base 16 (not allowing for leading zeros)

Backslashes, escapes, and quoting

The backslash character ('\') serves to introduce escaped constructs, as defined in the table above, as well as to quote characters that otherwise would be interpreted as unescaped constructs. Thus the expression \\ matches a single backslash and \{ matches a left brace.

It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.

Backslashes within string literals in Java source code are interpreted as required by the Java Language Specification as either Unicode escapes or other character escapes. It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler. The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary. The string literal "\(hello\)" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "\\(hello\\)" must be used.

Character Classes

Character classes may appear within other character classes, and may be composed by the union operator (implicit) and the intersection operator (&&). The union operator denotes a class that contains every character that is in at least one of its operand classes. The intersection operator denotes a class that contains every character that is in both of its operand classes.

The precedence of character-class operators is as follows, from highest to lowest:

1     Literal escape     \x
2     Grouping [...]
3     Range a-z
4     Union [a-e][i-u]
5     Intersection [a-z&&[aeiou]]

Note that a different set of metacharacters are in effect inside a character class than outside a character class. For instance, the regular expression . loses its special meaning inside a character class, while the expression - becomes a range forming metacharacter.

Line terminators

A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

If UNIX_LINES mode is activated, then the only line terminators recognized are newline characters.

The regular expression . matches any character except a line terminator unless the DOTALL flag is specified.

By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.

Groups and capturing

Group number

Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:

1     ((A)(B(C)))
2     (A)
3     (B(C))
4     (C)

Group zero always stands for the entire expression.

Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete.

Note: To use .NET's numbering for capture groups (instead of Java's), specify the DOTNET_NUMBERING flag when compiling a pattern.

Group name

A capturing group can also be assigned a "name", a named-capturing group, and then be back-referenced later by the "name". Group names are composed of the following characters:

A named-capturing group is still numbered as described in Group number.

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.

Groups beginning with (? are either pure, non-capturing groups that do not capture text and do not count towards the group total, or named-capturing groups.

Note: by default, capture group names must be unique, and if multiple groups with the same name exist, a PatternSyntaxException is thrown. By setting the DUPLICATE_NAMES flag, multiple capture groups with the same name are allowed.

Group

A group is either the name of a named-capturing group or a string of the form groupName[occurrence].

Use a positive occurrence (starting with 1) to refer to a specific occurrence of the group. A negative occurrence is a relative occurrence of the group. If the occurrence is omitted, or zero, the reference is to the first matched group with the specified group name. For example, groupName and groupName[0] both refer to the first matched occurrence of "groupName".

This syntax allows referring to any capture group in the pattern - even if the case where multiple groups have the same name (see DUPLICATE_NAMES), or the same number (see "branch reset" pattern).

Using this syntax, to refer to

"Branch reset" pattern

Quoted from the PCRE manual (the DUPLICATE SUBPATTERN NUMBERS section)

Perl 5.10 introduced a feature where each alternative in a subpattern uses the same numbers for its capturing parentheses. Such a subpattern starts with (?| and is itself a non-capturing subpattern. This construct is useful when you want to capture part, but not all, of one of a number of alternatives.

Inside a branch reset pattern, capture groups are numbered as usual, but the number is reset at the start of each branch. The numbers of any capturing buffers that follow the subpattern start after the highest number used in any branch.

The following example is taken from the Perl documentation. The numbers underneath show in which buffer the captured content will be stored.

# before  ---------------branch-reset----------- after
/ ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1            2         2  3        2     3     4

As a note, nested branch reset patterns are fully supported:

(?| ( 1a ) ( 2a ) | ( 1b ) (?| ( 2b1 ) | ( 2b2 ) ) ) /x
#     1      2        1          2         2

Note: if the DOTNET_NUMBERING flag is set, named capture groups inside of a branch reset pattern will be numbered as if they were unnamed groups. The group remains a named group, and can still be referred by name.

(?|(?<One>1a)(2a)|(1b)(?<Two>2b))
#  1         2    1   2

Unicode support

This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.

Unicode escape sequences such as \u2014 in Java source code are processed as described in 3.3 of the Java Language Specification. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

Unicode blocks and categories are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property. Blocks are specified with the prefix In, as in InMongolian. Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and categories can be used both inside and outside of a character class.

The supported categories are those of The Unicode Standard in the version specified by the Character class. The category names are those defined in the Standard, both normative and informative. The block names supported by Pattern are the valid block names accepted and defined by UnicodeBlock.forName .

Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.

Comparison to Perl 5

The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.

Perl constructs not supported by this class:

Constructs supported by this class but not by Perl:

Notable differences from Perl:

For a more precise description of the behavior of regular expression constructs, please see Mastering Regular Expressions, 3nd Edition, Jeffrey E. F. Friedl, O'Reilly and Associates, 2006.

Numeric range

Regular expressions may have extensive functionality, but they are designed to match text, so matching a numeric range requires some extra work. Since the need to match numeric ranges is sometimes necessary, The Pattern class has built-in support for handling them.

To allow leading zeros in a match, use the syntax (?Z[start..end]). In this case, the match's width (number of digits matched), is between the number of digits in start and the number of digits in end. For example, (?Z[071..9]) matches a number between 9 and 71 with between 1 and 3 digits. As the previous example shows, you can specify a range as [start..end] or as [end..start]. As a note, a range can have a negative number for either its start or end, and the syntax remains the same.

In the case that one bound is negative, and the other bound is positive, the match's width is as follows. For a negative number, the number of digits in a match must be between 1 and the number of digits in the negative bound. For a positive number, the number of digits is between 1 and the number of digits in the positive bound.

To not allow leading zeros in a match, use the syntax (?NZ[start..end]). In this case, the match will not contain any leading zeros. For example, (?NZ[071..9]) will match the "9", in "09", but it won't match the entire "09", since leading zeros are not part of the match.

For either format, by default, the numbers are decimal numbers (base 10). If you want to match a range in a different base, specify the base number after the "Z" or "NZ". For example, (?Z16[0..ff]) will match a hex number between 0 and 0xFF - for example, "aa".

When working with bases above 10, letters are used as digits, for example, in base 16, 'A' through 'F' are used to represent digits 10 through 15. By default, when matching a number, both upper-case and lower-case digits are allowed. For example, (?Z16[0..ff]) will match both "AA" and "aa". By specifying an 'L' or a 'U' after the base number, you can force only lower or upper-case digits to match. The regex (?Z16U[0..ff]), for example, will match "AA", but not "aa". Note that regardless of this setting, in the pattern, either upper-case or lower-case digits may be used. For bases 10 or less, this setting has no effect, but, for consistency, can be specified - the regex (?Z8U[0..377]) is equivalent to (?Z8[0..377]).

See Also:
split(CharSequence, String, int), split(CharSequence, String), Serialized Form

Nested Class Summary
static class Pattern.Comparison
           
static class Pattern.Range
          Functions used to match a range of values e.g.
 
Field Summary
static int CANON_EQ
          Enables canonical equivalence.
static int CASE_INSENSITIVE
          Enables case-insensitive matching.
static int COMMENTS
          Permits whitespace and comments in pattern.
static int DOTALL
          Enables dotall mode.
static int DOTNET_NUMBERING
          Use .NET numbering for capture groups (instead of Java's).
static int DUPLICATE_NAMES
          Allows duplicate capture group names in pattern.
static Pattern EMPTY_PATTERN
          A pattern with the RegEx being the empty string
static int EXPLICIT_CAPTURE
          Enables explicit capture mode.
static int LITERAL
          Enables literal parsing of the pattern.
static int MULTILINE
          Enables multiline mode.
static int PERL_OCTAL
          Use Perl's octal syntax (instead of Java's).
static int UNICODE_CASE
          Enables Unicode-aware case folding.
static int UNIX_LINES
          Enables Unix lines mode.
static int VERIFY_GROUPS
          When compiling a pattern, verifies that all referenced groups exist.
 
Method Summary
 boolean addedGroups()
          Indicates whether additional capture groups were added to the internal pattern when refactoring the compiled regular expression.
 java.util.regex.Pattern bitwiseNegate()
          Alias for getInternalPattern().
static Pattern compile(java.lang.String regex)
          Compiles the given regular expression into a pattern.
static Pattern compile(java.lang.String regex, int flags)
          Compiles the given regular expression into a pattern with the given flags.
static Pattern compile(java.lang.String regex, PatternOptions... flags)
          Compiles the given regular expression into a pattern with the given flags.
 int flags()
          Returns this pattern's match flags.
 PatternFlags getFlags()
           
 java.util.regex.Pattern getInternalPattern()
          Gets the internal pattern
 int groupCount()
          Returns the number of capturing groups in this matcher's pattern.
 int groupCount(int group)
          Returns the number of capturing groups (with the given group index) in this matcher's pattern.
 int groupCount(java.lang.String groupName)
          Returns the number of capturing groups (with the given group name) in this matcher's pattern.
 boolean has(PatternFlag flag)
           
 java.lang.String internalPattern()
          Returns the regular expression from which the internal pattern was compiled.
 boolean isCase(java.lang.Object switchValue)
          'Case' implementation for this class, which allows testing a String against a number of regular expressions (in Groovy only).
 boolean isPartialMatch(java.lang.CharSequence input)
          Indicates whether the given input partially matches this Pattern.
 Matcher matcher(java.lang.CharSequence input)
          Creates a matcher that will match the given input against this pattern.
static boolean matches(java.lang.String regex, java.lang.CharSequence input)
          Compiles the given regular expression and attempts to match the given input against it.
static int naturalCompareTo(java.lang.CharSequence value1, java.lang.CharSequence value2)
          Compares two character sequences lexigraphically, except that embedded numbers are treated numerically.
 Pattern normalize()
          Normalizes the pattern by inlining all possible flags.
static Pattern normalize(java.util.regex.Pattern pattern)
           
 Pattern or(java.lang.CharSequence regex)
           
 Pattern or(int flags)
           
 Pattern or(Pattern pattern)
           
 Pattern or(java.util.regex.Pattern pattern)
           
 Pattern or(PatternFlag flag)
           
 Pattern or(java.util.Set<PatternFlag> flags)
           
 java.lang.String pattern()
          Returns the regular expression from which this pattern was compiled.
 Pattern plus(java.lang.CharSequence regex)
           
 Pattern plus(Pattern pattern)
           
 Pattern plus(java.util.regex.Pattern pattern)
           
 Pattern positive()
          Returns this Pattern.
static java.lang.String quote(java.lang.String s)
          Returns a literal pattern String for the specified String.
static java.lang.String range(int start, int end, java.lang.String mode)
          Returns a regular expression that matches the specified numeric range.
static java.lang.String range(java.lang.String start, java.lang.String end, java.lang.String mode)
          Returns a regular expression that matches the specified numeric range.
static java.lang.String replaceAll(java.lang.CharSequence input, java.lang.String regex, java.lang.String replacement)
          Replaces each substring of the given input sequence that matches the given regular expression with the given replacement.
static java.lang.String replaceFirst(java.lang.CharSequence input, java.lang.String regex, java.lang.String replacement)
          Replaces the first substring of the given input sequence that matches the given regular expression with the given replacement.
 java.lang.String[] split(java.lang.CharSequence input)
          Splits the given input sequence around matches of this pattern.
 java.lang.String[] split(java.lang.CharSequence input, int limit)
          Splits the given input sequence around matches of this pattern.
static java.lang.String[] split(java.lang.CharSequence input, java.lang.String regex)
          Splits this string around matches of the given regular expression.
static java.lang.String[] split(java.lang.CharSequence input, java.lang.String regex, int limit)
          Splits the given input sequence around matches of the given regular expression.
 java.lang.String toString()
          Returns the string representation of this pattern.
static Pattern valueOf(java.util.regex.Pattern pattern)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

UNIX_LINES

public static final int UNIX_LINES
Enables Unix lines mode.

In this mode, only the '\n' line terminator is recognized in the behavior of ., ^, and $.

Unix lines mode can also be enabled via the embedded flag expression (?d).

See Also:
Constant Field Values

CASE_INSENSITIVE

public static final int CASE_INSENSITIVE
Enables case-insensitive matching.

By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag.

Case-insensitive matching can also be enabled via the embedded flag expression (?i).

Specifying this flag may impose a slight performance penalty.

See Also:
Constant Field Values

COMMENTS

public static final int COMMENTS
Permits whitespace and comments in pattern.

In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line.

Comments mode can also be enabled via the embedded flag expression  (?x).

See Also:
Constant Field Values

MULTILINE

public static final int MULTILINE
Enables multiline mode.

In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.

Multiline mode can also be enabled via the embedded flag expression (?m).

See Also:
Constant Field Values

LITERAL

public static final int LITERAL
Enables literal parsing of the pattern.

When this flag is specified then the input string that specifies the pattern is treated as a sequence of literal characters. Metacharacters or escape sequences in the input sequence will be given no special meaning.

The flags CASE_INSENSITIVE and UNICODE_CASE retain their impact on matching when used in conjunction with this flag. The other flags become superfluous.

There is no embedded flag character for enabling literal parsing.

See Also:
Constant Field Values

DOTALL

public static final int DOTALL
Enables dotall mode.

In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Dotall mode can also be enabled via the embedded flag expression  (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)

See Also:
Constant Field Values

UNICODE_CASE

public static final int UNICODE_CASE
Enables Unicode-aware case folding.

When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.

Unicode-aware case folding can also be enabled via the embedded flag expression (?u).

Specifying this flag may impose a performance penalty.

See Also:
Constant Field Values

CANON_EQ

public static final int CANON_EQ
Enables canonical equivalence.

When this flag is specified then two characters will be considered to match if, and only if, their full canonical decompositions match. The expression "a\u030A", for example, will match the string "\u00E5" when this flag is specified. By default, matching does not take canonical equivalence into account.

There is no embedded flag character for enabling canonical equivalence.

Specifying this flag may impose a performance penalty.

See Also:
Constant Field Values

DUPLICATE_NAMES

public static final int DUPLICATE_NAMES
Allows duplicate capture group names in pattern.

If a pattern has this flag set, multiple capture groups with the same name are allowed. By default, capture group names must be unique.

Allowing duplicate names can also be enabled via the embedded flag expression (?J).

See Also:
Constant Field Values

VERIFY_GROUPS

public static final int VERIFY_GROUPS
When compiling a pattern, verifies that all referenced groups exist.

If this flag is set, a PatternSyntaxException will be thrown if the pattern contains a reference to a non-existent group, whereas, by default, no exception would be thrown.

Verification of groups can also be enabled via the embedded flag expression (?v).

See Also:
Constant Field Values

PERL_OCTAL

public static final int PERL_OCTAL
Use Perl's octal syntax (instead of Java's).

That is, \n is a back reference if at least that many groups have occurred at the current point in the pattern. Otherwise, up to the first three (octal) digits are used to form an octal code, and any additional trailing digits will be treated literally.

Using Perl's octal syntax can also be enabled via the embedded flag expression (?o).

See Also:
Constant Field Values

DOTNET_NUMBERING

public static final int DOTNET_NUMBERING
Use .NET numbering for capture groups (instead of Java's).

In .NET, named-capture groups are numbered like unnamed groups, but numbering of named groups starts after all unnamed groups have been counted.

For example, the expression ((?<One>A)B)?(?<Two>C)(D) produces the following capturing groups by number and name.

Number        Name         Pattern
0 none ((?<One>A)B)?(?<Two>C)(D)
1 none ((?<One>A)B)
2 none (D)
3 One (?<One>A)
4 Two (?<Two>C)

See Also:
Constant Field Values

EXPLICIT_CAPTURE

public static final int EXPLICIT_CAPTURE
Enables explicit capture mode.

In this mode, unnamed capture groups don't capture - that is, they are treated like non-capture groups. However, named capture groups can still be used for capturing (and they acquire numbers in the usual way).

Explicit capture mode can also be enabled via the embedded flag expression (?n).

Note: this feature is taken from .NET.

See Also:
Constant Field Values

EMPTY_PATTERN

public static final Pattern EMPTY_PATTERN
A pattern with the RegEx being the empty string

Method Detail

compile

public static Pattern compile(java.lang.String regex)
Compiles the given regular expression into a pattern.

Parameters:
regex - The expression to be compiled
Returns:
The compiled Pattern
Throws:
PatternSyntaxException - If the expression's patternSyntax is invalid

compile

public static Pattern compile(java.lang.String regex,
                              PatternOptions... flags)
Compiles the given regular expression into a pattern with the given flags.

Parameters:
regex - The expression to be compiled
flags - the flags
Returns:
The compiled Pattern
Throws:
PatternSyntaxException - If the expression's patternSyntax is invalid

compile

public static Pattern compile(java.lang.String regex,
                              int flags)
Compiles the given regular expression into a pattern with the given flags.

Parameters:
regex - The expression to be compiled
flags - Match flags, a bit mask that may include CASE_INSENSITIVE, MULTILINEDOTALL, UNICODE_CASE, CANON_EQ, UNIX_LINES, LITERAL, COMMENTS,

DUPLICATE_NAMES, VERIFY_GROUPS, PERL_OCTAL, DOTNET_NUMBERING, and EXPLICIT_CAPTURE

Returns:
The compiled Pattern
Throws:
java.lang.IllegalArgumentException - If bit values other than those corresponding to the defined match flags are set in flags
PatternSyntaxException - If the expression's patternSyntax is invalid

valueOf

public static Pattern valueOf(java.util.regex.Pattern pattern)
Parameters:
pattern -
Returns:
Since:
0.2

getInternalPattern

public java.util.regex.Pattern getInternalPattern()
Gets the internal pattern

Returns:
The internal Pattern used by this pattern.

internalPattern

public java.lang.String internalPattern()
Returns the regular expression from which the internal pattern was compiled.

Returns:
The source of the internal pattern

pattern

public java.lang.String pattern()
Returns the regular expression from which this pattern was compiled.

Returns:
The source of this pattern

addedGroups

public boolean addedGroups()
Indicates whether additional capture groups were added to the internal pattern when refactoring the compiled regular expression.

Returns:
Since:
0.2

groupCount

public int groupCount()
Returns the number of capturing groups in this matcher's pattern.

Group zero denotes the entire pattern by convention. It is not included in this count.

Any non-negative integer smaller than or equal to the value returned by this method is guaranteed to be a valid group index for this matcher.

Returns:
The number of capturing groups in this matcher's pattern

groupCount

public int groupCount(int group)
Returns the number of capturing groups (with the given group index) in this matcher's pattern.

Note: in most cases, this return will be 1 - the only exception is in the case of a "branch reset" pattern, where there may be multiple groups with the same group index.

For example,

// Outputs 2, since there are two groups that have the group index of 1
 System.out.println(Pattern.compile("(?|(1a)|(1b))").groupCount(1));

Group zero denotes the entire pattern by convention. It is not included in this count.

Any non-negative integer smaller than or equal to the value returned by this method is guaranteed to be a valid occurrence (for a group, groupName[occurrence]) for this matcher.

Note: unlike other methods, this method doesn't throw an exception if the specified group doesn't exist. Instead, zero is returned, since the number of groups with the (non-existent) group name is zero.

Parameters:
group - The group index for a capturing group in this matcher's pattern
Returns:
The number of capturing groups (with the given group index) in this matcher's pattern
Since:
0.2

groupCount

public int groupCount(java.lang.String groupName)
Returns the number of capturing groups (with the given group name) in this matcher's pattern.

Group zero denotes the entire pattern by convention. It is not included in this count.

Any non-negative integer smaller than or equal to the value returned by this method is guaranteed to be a valid occurrence (for a group, groupName[occurrence]) for this matcher.

If groupName is the empty string, this method's return is equal to the return from groupCount().

Note: unlike other methods, this method doesn't throw an exception if the specified group doesn't exist. Instead, zero is returned, since the number of groups with the (non-existent) group name is zero.

Parameters:
groupName - The group name for a capturing group in this matcher's pattern
Returns:
The number of capturing groups (with the given group name) in this matcher's pattern

toString

public java.lang.String toString()
Returns the string representation of this pattern. This is the regular expression from which this pattern was compiled.

Overrides:
toString in class java.lang.Object
Returns:
The string representation of this pattern

matcher

public Matcher matcher(java.lang.CharSequence input)
Creates a matcher that will match the given input against this pattern.

Parameters:
input - The character sequence to be matched
Returns:
A new matcher for this pattern

isPartialMatch

public boolean isPartialMatch(java.lang.CharSequence input)
Indicates whether the given input partially matches this Pattern.

For the given input to be a partial match, it must be the prefix of some valid match. Conversely, if this method returns false, then appending characters to the given input will never yield a match.

For example, given the following pattern to match a decimal number

 Pattern p = Pattern.compile("\\d+\\.\\d+");
The following calls return true
 p.isPartialMatch("");
 p.isPartialMatch("1");
 p.isPartialMatch("2");
 p.isPartialMatch("9");
 p.isPartialMatch("123");
 p.isPartialMatch("123.");
 p.isPartialMatch("123.456");
 // p.matcher("123.456").matches() would also return true (see note below)
Whereas these calls return false
 p.isPartialMatch("a");
 p.isPartialMatch(".");
 p.isPartialMatch(".4");
 p.isPartialMatch(".45");
 p.isPartialMatch(".456");

Note: if the given input would match the pattern, this method returns true. That is, a match is also a partial match.

Parameters:
input - The character sequence to be matched
Returns:
true if, and only if, the given input partially matches this pattern

flags

public int flags()
Returns this pattern's match flags.

Returns:
The match flags specified when this pattern was compiled

getFlags

public PatternFlags getFlags()
Returns:
Since:
0.2

normalize

public static Pattern normalize(java.util.regex.Pattern pattern)
Parameters:
pattern -
Returns:

normalize

public Pattern normalize()
Normalizes the pattern by inlining all possible flags.

Note: the returned pattern matches the exact same inputs as this pattern.

Returns:
the normalized pattern

matches

public static boolean matches(java.lang.String regex,
                              java.lang.CharSequence input)
Compiles the given regular expression and attempts to match the given input against it.

An invocation of this convenience method of the form

 Pattern.matches(regex, input);
behaves in exactly the same way as the expression
 Pattern.compile(regex).matcher(input).matches()

If a pattern is to be used multiple times, compiling it once and reusing it will be more efficient than invoking this method each time.

Parameters:
regex - The expression to be compiled
input - The character sequence to be matched
Returns:
true if, and only if, the entire region sequence matches this matcher's pattern
Throws:
PatternSyntaxException - If the expression's patternSyntax is invalid

replaceFirst

public static java.lang.String replaceFirst(java.lang.CharSequence input,
                                            java.lang.String regex,
                                            java.lang.String replacement)
Replaces the first substring of the given input sequence that matches the given regular expression with the given replacement.

An invocation of this method of the form  Pattern.replaceFirst(input, regex, replacement) yields exactly the same result as the expression

Pattern.compile(regex).matcher(input).replaceFirst(replacement)

Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; see Matcher.replaceFirst(java.lang.String). Use Matcher.quoteReplacement(java.lang.String) to suppress the special meaning of these characters, if desired.

Note: this function serves as a substitute for String.replaceFirst(String, String).

Parameters:
input - The character sequence to be matched
regex - The regular expression to which the input sequence is to be matched
replacement - The string to be substituted for the first match
Returns:
The resulting String
Throws:
PatternSyntaxException - If the regular expression's patternSyntax is invalid

replaceAll

public static java.lang.String replaceAll(java.lang.CharSequence input,
                                          java.lang.String regex,
                                          java.lang.String replacement)
Replaces each substring of the given input sequence that matches the given regular expression with the given replacement.

An invocation of this method of the form  Pattern.replaceAll(input, regex, replacement) yields exactly the same result as the expression

Pattern.compile(regex).matcher(input).replaceAll(replacement)

Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; see Matcher.replaceAll(java.lang.String). Use Matcher.quoteReplacement(java.lang.String) to suppress the special meaning of these characters, if desired.

Note: this function serves as a substitute for String.replaceAll(String, String).

Parameters:
input - The character sequence to be matched
regex - The regular expression to which the input sequence is to be matched
replacement - The string to be substituted for each match
Returns:
The resulting String
Throws:
PatternSyntaxException - If the regular expression's patternSyntax is invalid

split

public static java.lang.String[] split(java.lang.CharSequence input,
                                       java.lang.String regex,
                                       int limit)
Splits the given input sequence around matches of the given regular expression.

The array returned by this method contains each substring of the input sequence that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely the input sequence.

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

The string "boo:and:foo", for example, yields the following results with these parameters:

Regex Limit Result
: 2 { "boo", "and:foo" }
: 5 { "boo", "and", "foo" }
: -2 { "boo", "and", "foo" }
o 5 { "b", "", ":and:f", "", "" }
o -2 { "b", "", ":and:f", "", "" }
o 0 { "b", "", ":and:f" }

An invocation of this method of the form  Pattern.split(input, regex, n) yields the same result as the expression

Pattern.compile(regex).split(input,limit)

Note: this function serves as a substitute for String.split(String, int).

Parameters:
input - The character sequence to be split
regex - The delimiting regular expression
limit - The result threshold, as described above
Returns:
The array of strings computed by splitting the input sequence around matches of the given regular expression
Throws:
PatternSyntaxException - If the regular expression's patternSyntax is invalid

split

public static java.lang.String[] split(java.lang.CharSequence input,
                                       java.lang.String regex)
Splits this string around matches of the given regular expression.

This method works as if by invoking the three-argument split method with the given input sequence, expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

The string "boo:and:foo", for example, yields the following results with these expressions:

Regex Result
: { "boo", "and", "foo" }
o { "b", "", ":and:f" }

Note: this function serves as a substitute for String.split(String).

Parameters:
input - The character sequence to be split
regex - The delimiting regular expression
Returns:
The array of strings computed by splitting the input sequence around matches of the given regular expression
Throws:
PatternSyntaxException - If the regular expression's patternSyntax is invalid

split

public java.lang.String[] split(java.lang.CharSequence input,
                                int limit)
Splits the given input sequence around matches of this pattern.

The array returned by this method contains each substring of the input sequence that is terminated by another subsequence that matches this pattern or is terminated by the end of the input sequence. The substrings in the array are in the order in which they occur in the input. If this pattern does not match any subsequence of the input then the resulting array has just one element, namely the input sequence in string form.

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

The input "boo:and:foo", for example, yields the following results with these parameters:

Regex    

Limit    

Result    

: 2 { "boo", "and:foo" }
: 5 { "boo", "and", "foo" }
: -2 { "boo", "and", "foo" }
o 5 { "b", "", ":and:f", "", "" }
o -2 { "b", "", ":and:f", "", "" }
o 0 { "b", "", ":and:f" }

Parameters:
input - The character sequence to be split
limit - The result threshold, as described above
Returns:
The array of strings computed by splitting the input around matches of this pattern

split

public java.lang.String[] split(java.lang.CharSequence input)
Splits the given input sequence around matches of this pattern.

This method works as if by invoking the two-argument split method with the given input sequence and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

The input "boo:and:foo", for example, yields the following results with these expressions:

Regex    

Result

: { "boo", "and", "foo" }
o { "b", "", ":and:f" }

Parameters:
input - The character sequence to be split
Returns:
The array of strings computed by splitting the input around matches of this pattern

quote

public static java.lang.String quote(java.lang.String s)
Returns a literal pattern String for the specified String.

This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.

Metacharacters or escape sequences in the input sequence will be given no special meaning.

Parameters:
s - The string to be literalized
Returns:
A literal string replacement

has

public boolean has(PatternFlag flag)
Parameters:
flag -
Returns:
Since:
0.2

naturalCompareTo

public static int naturalCompareTo(java.lang.CharSequence value1,
                                   java.lang.CharSequence value2)
Compares two character sequences lexigraphically, except that embedded numbers are treated numerically.

For example, when using this method, "1.2.9.1" is less than "1.2.10.5", whereas a lexigraphical comparison would yield the opposite.

When comparing, leading zeros are ignored, unless the inputted sequences are otherwise equivalent. If the two inputs are identical, then 0 is returned. Otherwise, the left-most number where the number of leading zeros differs is used to determine the ordering. In this case, the one with more leading zeros is first.

For example, the below list is sorted in increasing order:

  1. 2009-1-2
  2. 2009-01-05
  3. 2009-01-5
  4. 2009-1-05
  5. 2009-1-5

This function can be used to compare versions, dates, and other numeric based data. Since the comparison is done from left to right, the format must have the most significant part first. For example, in a date format, that would be year, month, and then day to sort in chronological order.

Note that for correct sorting of numeric based data, the format's must be identical - otherwise, where the formats differ, the sorting is based on the ascii value of the change in the format. For example, the date "2009-1-5" is less than "2009.1.2", but not chronologically before. This result is due to the ascii value for '-' (\u2d) being less than the ascii value for '.' (\u2e).

This method can be called in the compare function of a Comparator object to provide sorting.

 Comparator<String> comparator = new Comparator<String>() {
 
   public int compare(String o1, String o2) {
     return naturalCompareTo(o1, o2);
   }
 };

Parameters:
value1 - the first character sequence
value2 - the second character sequence
Returns:
0 if the two values are equal, -1 if the first value is "less than" the second, and 1 if the first value is "greater than" the second.
See Also:
String.compareTo(String)

isCase

public boolean isCase(java.lang.Object switchValue)
'Case' implementation for this class, which allows testing a String against a number of regular expressions (in Groovy only). For example:
switch( str ) {
 case +/one/ :
 // the regex 'one' matches the value of str
 }
 

Parameters:
switchValue - the switch value
Returns:
true if the switchValue is deemed to match this Pattern
Since:
0.2

bitwiseNegate

public java.util.regex.Pattern bitwiseNegate()
Alias for getInternalPattern().

Returns:
the regular expression pattern
Since:
0.2

positive

public Pattern positive()
Returns this Pattern.

Added for consistency for use in Groovy, since both +charSequence and +javaPattern are also supported. This method ensures that the 'positive' operator will return a RegExPlus Pattern, for all three cases:

  1. Compiling a CharSequence regex: +charSequence
  2. Promoting a Java Pattern: +javaPattern
  3. When used on an existing RegExPlus Pattern: +regexplusPattern

Returns:
this Pattern.

or

public Pattern or(java.lang.CharSequence regex)
Parameters:
regex -
Returns:
Since:
0.2

or

public Pattern or(Pattern pattern)
Parameters:
pattern -
Returns:
Since:
0.2

or

public Pattern or(java.util.regex.Pattern pattern)
Parameters:
pattern -
Returns:
Since:
0.2

or

public Pattern or(PatternFlag flag)

or

public Pattern or(java.util.Set<PatternFlag> flags)

or

public Pattern or(int flags)

plus

public Pattern plus(java.lang.CharSequence regex)
Parameters:
regex -
Returns:
Since:
0.2

plus

public Pattern plus(Pattern pattern)
Parameters:
pattern -
Returns:
Since:
0.2

plus

public Pattern plus(java.util.regex.Pattern pattern)
Parameters:
pattern -
Returns:
Since:
0.2

range

public static java.lang.String range(int start,
                                     int end,
                                     java.lang.String mode)
Returns a regular expression that matches the specified numeric range. The returned expression is wrapped in a non-capture group to allow easy integration.

The mode parameter has the same form as the leading part of a numeric range. The return from range(start, end, mode) is equivalent to the internal representation of a numeric range.

Format for mode parameter: Mode[Base[BaseMode]]

Descriptions and valid values:

Parameters:
start - the start of the range
end - the end of the range
mode - a string in the format described above that specifies the mode for the numeric range
Returns:
a regular expression that matches the specified numeric range, wrapped in a non-capture group for easy integration
Throws:
java.lang.IllegalArgumentException - If mode is not in the correct form, as described above

range

public static java.lang.String range(java.lang.String start,
                                     java.lang.String end,
                                     java.lang.String mode)
Returns a regular expression that matches the specified numeric range. The returned expression is wrapped in a non-capture group to allow easy integration.

The mode parameter has the same form as the leading part of a numeric range. The return from range(start, end, mode) is equivalent to the internal representation of a numeric range.

Format for mode parameter: Mode[Base[BaseMode]]

Descriptions and valid values:

Parameters:
start - the start of the range
end - the end of the range
mode - a string in the format described above that specifies the mode for the numeric range
Returns:
a regular expression that matches the specified numeric range, wrapped in a non-capture group for easy integration
Throws:
java.lang.NullPointerException - If either start or end is null
java.lang.IllegalArgumentException - If either start or end is the empty string or contains invalid digits for the specified base; also thrown if mode is not in the correct form, as described above