/* Package regex implements regular expression search, using a custom non-bracktracking engine with support for lookarounds and numeric ranges. The engine relies completely on UTF-8 codepoints. As such, it is capable of matching characters from other languages, emojis and symbols. The full syntax is specified below. # Syntax Single characters: . Match any character. Newline matching is dependent on the RE_SINGLE_LINE flag. [abc] Character class - match a, b or c [a-z] Character range - match any character from a to z [^abc] Negated character class - match any character except a, b and c [^a-z] Negated character range - do not match any character from a to z \[ Match a literal '['. Backslashes can escape any character with special meaning, including another backslash. \452 Match the character with the octal value 452 (up to 3 digits) \xFF Match the character with the hex value FF (exactly 2 characters) \x{0000FF} Match the character with the hex value 0000FF (exactly 6 characters) \n Newline \a Bell character \f Form-feed character \r Carriage return \t Horizontal tab \v Vertical tab Perl classes: \d Match any digit character ([0-9]) \D Match any non-digit character ([^0-9]) \w Match any word character ([a-zA-Z0-9_]) \W Match any word character ([^a-zA-Z0-9_]) \s Match any whitespace character ([ \t\n]) \S Match any non-whitespace character ([^ \t\n]) POSIX classes (inside normal character classes): [:digit:] All digit characters ([0-9]) [:upper:] All upper-case letters ([A-Z]) [:lower:] All lower-case letters ([a-z]) [:alpha:] All letters ([a-zA-Z]) [:alnum:] All alphanumeric characters ([a-zA-Z0-9]) [:xdigit:] All hexadecimal characters ([a-fA-F0-9]) [:blank:] All blank characters ([ \t]) [:space:] All whitespace characters ([ \t\n\r\f\v]) [:cntrl:] All control characters ([\x00-\x1F\x7F]) [:punct:] All punctuation characters [:graph:] All graphical characters ([\x21-\x7E]) [:print:] All graphical characters + space ([\x20-\x7E]) [:word:] All word characters (\w) [:ascii:] All ASCII values ([\x00-\x7F]) Composition: def Match d, followed by e, followed by f x|y Match x or y (prefer longer one) xy|z Match xy or z Repitition (always greedy, preferring more): x* Match x zero or more times x+ Match x one or more times x? Match x zero or one time x{m,n} Match x between m and n times (inclusive) x{m,} Match x atleast m times x{,n} Match x between 0 and n times (inclusive) x{m} Match x exactly m times Grouping: (expr) Create a capturing group. The contents of the group can be retrieved with [FindAllMatches] x(y|z) Match x followed by y or z. Given a successful match, the contents of group 1 will include either y or z (?:expr) Create a non-capturing group. The contents of the group aren't saved. x(?:y|z) Match x followed by y or z. No groups are created. Assertions: ^ Match at the start of the input string. If RE_MULTILINE is enabled, it also matches at the start of every line. $ Match at the end of the input string. If RE_MULTILINE is enabled, it also matches at the end of every line. \A Always match at the start of the string, regardless of RE_MULTILINE \z Always match at the end of the string, regardless of RE_MULTILINE \b Match at a word boundary (a word character followed by a non-word character, or vice-versa) \B Match at a non-word boundary (a word character followed by a word character, or vice-versa) Lookarounds: x(?=y) Positive lookahead - Match x if followed by y x(?!y) Negative lookahead - Match x if NOT followed by y (?<=x)y Positive lookbehind - Match y if preceded by x (? Match any number from x to y (inclusive) (x and y must be positive numbers) # Key Differences with regexp The engine and the API differ from [regexp] in a number of ways, some of them very subtle. The key differences are mentioned below. 1. Greediness: This engine does not support non-greedy operators. All operators are always greedy in nature, and will try to match as much as they can, while still allowing for a successful match. For example, given the regex: y*y The engine will match as many 'y's as it can, while still allowing the trailing 'y' to be matched. Another, more subtle example is the following regex: x|xx While the stdlib implementation (and most other engines) will prefer matching the first item of the alternation, this engine will _always_ go for the longest possible match, regardless of the order of the alternation. 2. Byte-slices and runes: My engine does not support byte-slices. When a matching function receives a string, it converts it into a rune-slice to iterate through it. While this has some space overhead, the convenience of built-in unicode support made the tradeoff worth it. 3. Return values Rather than using primitives for return values, my engine defines two types that are used as return values: a [Group] represents a capturing group, and a [Match] represents a list of groups. [regexp] specifies a regular expression that gives a list of all the matching functions that it supports. The equivalent expression for this engine is: Find(All)?(String)?(Submatch)? [Reg.Find] returns the index of the leftmost match in the string. If a function contains 'All' it returns all matches instead of just the leftmost one. If a function contains 'String' it returns the matched text, rather than the indices. If a function contains 'Submatch' it returns the match, including all submatches found by capturing groups. The term '0-group' is used to refer to the 0th capturing group of a match (which is the entire match). Given the following regex: x(y) and the input string: xyz The 0th group would contain 'xy' and the 1st group would contain 'y'. Any matching function without 'Submatch' in its name returns the 0-group. */ package regex