|
|
|
/*
|
|
|
|
Package regex implements regular expression search, using a custom non-bracktracking engine with support for lookarounds and numeric ranges.
|
|
|
|
|
|
|
|
The engine relies completely on UTF-8 codepoints. As such, it is capable of matching characters
|
|
|
|
from other languages, emojis and symbols.
|
|
|
|
|
|
|
|
The full syntax is specified below.
|
|
|
|
|
|
|
|
# Syntax
|
|
|
|
|
|
|
|
Single characters:
|
|
|
|
|
|
|
|
. Match any character. Newline matching is dependent on the RE_SINGLE_LINE flag.
|
|
|
|
[abc] Character class - match a, b or c
|
|
|
|
[a-z] Character range - match any character from a to z
|
|
|
|
[^abc] Negated character class - match any character except a, b and c
|
|
|
|
[^a-z] Negated character range - do not match any character from a to z
|
|
|
|
\[ Match a literal '['. Backslashes can escape any character with special meaning, including another backslash.
|
|
|
|
\452 Match the character with the octal value 452 (up to 3 digits)
|
|
|
|
\xFF Match the character with the hex value FF (exactly 2 characters)
|
|
|
|
\x{0000FF} Match the character with the hex value 0000FF (exactly 6 characters)
|
|
|
|
\n Newline
|
|
|
|
\a Bell character
|
|
|
|
\f Form-feed character
|
|
|
|
\r Carriage return
|
|
|
|
\t Horizontal tab
|
|
|
|
\v Vertical tab
|
|
|
|
|
|
|
|
Perl classes:
|
|
|
|
|
|
|
|
\d Match any digit character ([0-9])
|
|
|
|
\D Match any non-digit character ([^0-9])
|
|
|
|
\w Match any word character ([a-zA-Z0-9_])
|
|
|
|
\W Match any word character ([^a-zA-Z0-9_])
|
|
|
|
\s Match any whitespace character ([ \t\n])
|
|
|
|
\S Match any non-whitespace character ([^ \t\n])
|
|
|
|
|
|
|
|
POSIX classes (inside normal character classes):
|
|
|
|
|
|
|
|
[:digit:] All digit characters ([0-9])
|
|
|
|
[:upper:] All upper-case letters ([A-Z])
|
|
|
|
[:lower:] All lower-case letters ([a-z])
|
|
|
|
[:alpha:] All letters ([a-zA-Z])
|
|
|
|
[:alnum:] All alphanumeric characters ([a-zA-Z0-9])
|
|
|
|
[:xdigit:] All hexadecimal characters ([a-fA-F0-9])
|
|
|
|
[:blank:] All blank characters ([ \t])
|
|
|
|
[:space:] All whitespace characters ([ \t\n\r\f\v])
|
|
|
|
[:cntrl:] All control characters ([\x00-\x1F\x7F])
|
|
|
|
[:punct:] All punctuation characters
|
|
|
|
[:graph:] All graphical characters ([\x21-\x7E])
|
|
|
|
[:print:] All graphical characters + space ([\x20-\x7E])
|
|
|
|
[:word:] All word characters (\w)
|
|
|
|
[:ascii:] All ASCII values ([\x00-\x7F])
|
|
|
|
|
|
|
|
Composition:
|
|
|
|
|
|
|
|
def Match d, followed by e, followed by f
|
|
|
|
x|y Match x or y (prefer longer one)
|
|
|
|
xy|z Match xy or z
|
|
|
|
|
|
|
|
Repitition (always greedy, preferring more):
|
|
|
|
|
|
|
|
x* Match x zero or more times
|
|
|
|
x+ Match x one or more times
|
|
|
|
x? Match x zero or one time
|
|
|
|
x{m,n} Match x between m and n times (inclusive)
|
|
|
|
x{m,} Match x atleast m times
|
|
|
|
x{,n} Match x between 0 and n times (inclusive)
|
|
|
|
x{m} Match x exactly m times
|
|
|
|
|
|
|
|
Grouping:
|
|
|
|
|
|
|
|
(expr) Create a capturing group. The contents of the group can be retrieved with [FindAllMatches]
|
|
|
|
x(y|z) Match x followed by y or z. Given a successful match, the contents of group 1 will include either y or z
|
|
|
|
(?:expr) Create a non-capturing group. The contents of the group aren't saved.
|
|
|
|
x(?:y|z) Match x followed by y or z. No groups are created.
|
|
|
|
|
|
|
|
Assertions:
|
|
|
|
|
|
|
|
^ Match at the start of the input string. If RE_MULTILINE is enabled, it also matches at the start of every line.
|
|
|
|
$ Match at the end of the input string. If RE_MULTILINE is enabled, it also matches at the end of every line.
|
|
|
|
\A Always match at the start of the string, regardless of RE_MULTILINE
|
|
|
|
\z Always match at the end of the string, regardless of RE_MULTILINE
|
|
|
|
\b Match at a word boundary (a word character followed by a non-word character, or vice-versa)
|
|
|
|
\B Match at a non-word boundary (a word character followed by a word character, or vice-versa)
|
|
|
|
|
|
|
|
Lookarounds:
|
|
|
|
|
|
|
|
x(?=y) Positive lookahead - Match x if followed by y
|
|
|
|
x(?!y) Negative lookahead - Match x if NOT followed by y
|
|
|
|
(?<=x)y Positive lookbehind - Match y if preceded by x
|
|
|
|
(?<!x)y Negative lookbehind - Match y if NOT preceded by x
|
|
|
|
|
|
|
|
Numeric ranges:
|
|
|
|
|
|
|
|
<x-y> Match any number from x to y (inclusive) (x and y must be positive numbers)
|
|
|
|
|
|
|
|
# Key Differences with regexp
|
|
|
|
|
|
|
|
The engine and the API differ from [regexp] in a number of ways, some of them very subtle.
|
|
|
|
The key differences are mentioned below.
|
|
|
|
|
|
|
|
1. Greediness:
|
|
|
|
|
|
|
|
This engine does not support non-greedy operators. All operators are always greedy in nature, and will try
|
|
|
|
to match as much as they can, while still allowing for a successful match. For example, given the regex:
|
|
|
|
|
|
|
|
y*y
|
|
|
|
|
|
|
|
The engine will match as many 'y's as it can, while still allowing the trailing 'y' to be matched.
|
|
|
|
|
|
|
|
Another, more subtle example is the following regex:
|
|
|
|
|
|
|
|
x|xx
|
|
|
|
|
|
|
|
While the stdlib implementation (and most other engines) will prefer matching the first item of the alternation,
|
|
|
|
this engine will _always_ go for the longest possible match, regardless of the order of the alternation.
|
|
|
|
|
|
|
|
2. Byte-slices and runes:
|
|
|
|
|
|
|
|
My engine does not support byte-slices. When a matching function receives a string, it converts it into a
|
|
|
|
rune-slice to iterate through it. While this has some space overhead, the convenience of built-in unicode
|
|
|
|
support made the tradeoff worth it.
|
|
|
|
|
|
|
|
3. Return values
|
|
|
|
|
|
|
|
Rather than using primitives for return values, my engine defines two types that are used as return
|
|
|
|
values: a [Group] represents a capturing group, and a [Match] represents a list of groups.
|
|
|
|
|
|
|
|
[regexp] specifies a regular expression that gives a list of all the matching functions that it supports. The
|
|
|
|
equivalent expression for this engine is:
|
|
|
|
|
|
|
|
Find(All)?(String)?(Submatch)?
|
|
|
|
|
|
|
|
[Reg.Find] returns the index of the leftmost match in the string.
|
|
|
|
|
|
|
|
If a function contains 'All' it returns all matches instead of just the leftmost one.
|
|
|
|
|
|
|
|
If a function contains 'String' it returns the matched text, rather than the indices.
|
|
|
|
|
|
|
|
If a function contains 'Submatch' it returns the match, including all submatches found by
|
|
|
|
capturing groups.
|
|
|
|
|
|
|
|
The term '0-group' is used to refer to the 0th capturing group of a match (which is the entire match).
|
|
|
|
Given the following regex:
|
|
|
|
|
|
|
|
x(y)
|
|
|
|
|
|
|
|
and the input string:
|
|
|
|
|
|
|
|
xyz
|
|
|
|
|
|
|
|
The 0th group would contain 'xy' and the 1st group would contain 'y'. Any matching function without 'Submatch' in its name
|
|
|
|
returns the 0-group.
|
|
|
|
*/
|
|
|
|
package regex
|