/* Package regex implements regular expression search, using a custom non-bracktracking engine with support for lookarounds and numeric ranges. The engine relies completely on UTF-8 codepoints. As such, it is capable of matching characters from other languages, emojis and symbols. The full syntax is specified below. # Syntax Single characters: . Match any character. Newline matching is dependent on the RE_SINGLE_LINE flag. [abc] Character class - match a, b or c [a-z] Character range - match any character from a to z [^abc] Negated character class - match any character except a, b and c [^a-z] Negated character range - do not match any character from a to z \[ Match a literal '['. Backslashes can escape any character with special meaning, including another backslash. \452 Match the character with the octal value 452 (up to 3 digits) \xFF Match the character with the hex value FF (exactly 2 characters) \x{0000FF} Match the character with the hex value 0000FF (exactly 6 characters) \n Newline \a Bell character \f Form-feed character \r Carriage return \t Horizontal tab \v Vertical tab Perl classes: \d Match any digit character ([0-9]) \D Match any non-digit character ([^0-9]) \w Match any word character ([a-zA-Z0-9_]) \W Match any word character ([^a-zA-Z0-9_]) \s Match any whitespace character ([ \t\n]) \S Match any non-whitespace character ([^ \t\n]) POSIX classes (inside normal character classes): [:digit:] All digit characters ([0-9]) [:upper:] All upper-case letters ([A-Z]) [:lower:] All lower-case letters ([a-z]) [:alpha:] All letters ([a-zA-Z]) [:alnum:] All alphanumeric characters ([a-zA-Z0-9]) [:xdigit:] All hexadecimal characters ([a-fA-F0-9]) [:blank:] All blank characters ([ \t]) [:space:] All whitespace characters ([ \t\n\r\f\v]) [:cntrl:] All control characters ([\x00-\x1F\x7F]) [:punct:] All punctuation characters [:graph:] All graphical characters ([\x21-\x7E]) [:print:] All graphical characters + space ([\x20-\x7E]) [:word:] All word characters (\w) [:ascii:] All ASCII values ([\x00-\x7F]) Composition: def Match d, followed by e, followed by f x|y Match x or y (prefer longer one) xy|z Match xy or z Repitition (always greedy, preferring more): x* Match x zero or more times x+ Match x one or more times x? Match x zero or one time x{m,n} Match x between m and n times (inclusive) x{m,} Match x atleast m times x{,n} Match x between 0 and n times (inclusive) x{m} Match x exactly m times Grouping: (expr) Create a capturing group. The contents of the group can be retrieved with [FindAllMatches] x(y|z) Match x followed by y or z. Given a successful match, the contents of group 1 will include either y or z (?:expr) Create a non-capturing group. The contents of the group aren't saved. x(?:y|z) Match x followed by y or z. No groups are created. Assertions: ^ Match at the start of the input string. If RE_MULTILINE is enabled, it also matches at the start of every line. $ Match at the end of the input string. If RE_MULTILINE is enabled, it also matches at the end of every line. \A Always match at the start of the string, regardless of RE_MULTILINE \z Always match at the end of the string, regardless of RE_MULTILINE \b Match at a word boundary (a word character followed by a non-word character, or vice-versa) \B Match at a non-word boundary (a word character followed by a word character, or vice-versa) Lookarounds: x(?=y) Positive lookahead - Match x if followed by y x(?!y) Negative lookahead - Match x if NOT followed by y (?<=x)y Positive lookbehind - Match y if preceded by x (? Match any number from x to y (inclusive) (x and y must be positive numbers) # Key Differences with regexp The engine and the API differ from [regexp] in a number of ways, some of them very subtle. The key differences are mentioned below. 1. Greediness: This engine does not support non-greedy operators. All operators are always greedy in nature, and will try to match as much as they can, while still allowing for a successful match. For example, given the regex: y*y The engine will match as many 'y's as it can, while still allowing the trailing 'y' to be matched. Another, more subtle example is the following regex: x|xx While the stdlib implementation (and most other engines) will prefer matching the first item of the alternation, this engine will go for the longest possible match, regardless of the order of the alternation. Although this strays from the convention, it results in a nice rule-of-thumb - the engine is ALWAYS greedy. The stdlib implementation has a function [regexp.Regexp.Longest] which makes future searches prefer the longest match. That is the default (and unchangable) behavior in this engine. 2. Byte-slices and runes: My engine does not support byte-slices. When a matching function receives a string, it converts it into a rune-slice to iterate through it. While this has some space overhead, the convenience of built-in unicode support made the tradeoff worth it. 3. Return values Rather than using primitives for return values, my engine defines two types that are used as return values: a [Group] represents a capturing group, and a [Match] represents a list of groups. [regexp] specifies a regular expression that gives a list of all the matching functions that it supports. The equivalent expression for this engine is: Find(All)?(String)?(Submatch)? [Reg.Find] returns the index of the leftmost match in the string. If a function contains 'All' it returns all matches instead of just the leftmost one. If a function contains 'String' it returns the matched text, rather than the indices. If a function contains 'Submatch' it returns the match, including all submatches found by capturing groups. The term '0-group' is used to refer to the 0th capturing group of a match (which is the entire match). Given the following regex: x(y) and the input string: xyz The 0th group would contain 'xy' and the 1st group would contain 'y'. Any matching function without 'Submatch' in its name returns the 0-group. */ package regex