RegExPlus

Unicode

RegExPlus includes the same unicode syntax available in Java regular expressions, and adds some additional syntax found in other regex engines.

\xhh

Matches the character with hexadecimal value 0xhh

Note: if the first hexadecimal digit is zero, it can be omitted. For example, \x5 and \x05 are equivalent.

\x{hhh...}

Matches the character with hexadecimal value 0xhhh..

With this syntax, you can match characters specified with either the \xFF or \uFFFF syntax, as well as supplementary characters, e.g. \x{10000}.

\uhhhh

Matches the character with hexadecimal value 0xhhhh

Note: you can omit any leading zeros, e.g. \uE0 and \u00E0 are both valid - they match à.

\X

\X matches a single grapheme. It is similar to the ., but \X always matches a newline character.

It's equivalent to the regex (?>\P{M}\p{M}*), that is, a non-combining mark, followed by zero or more combining marks.


In unicode, some characters can be encoded in multiple ways. For example, à and à look the same, but the first is a single character, while the second is the letter a followed by  ̀ (a combining grave accent, U+0300).


The regex ^\X$ would match both of the above representations, whereas ^.$ would only match the first.