kleingrep/regex/doc.go

/*
Package regex implements regular expression search, using a custom non-bracktracking engine with support for lookarounds and numeric ranges.

The engine relies completely on UTF-8 codepoints. As such, it is capable of matching characters
from other languages, emojis and symbols.

The API and regex syntax are largely compatible with that of the stdlib's [regexp], with a few key differences (see 'Key Differences with regexp').

The full syntax is specified below.

# Syntax

Single characters:

	.				Match any character. Newline matching is dependent on the RE_SINGLE_LINE flag.
	[abc]			Character class - match a, b or c
	[a-z]			Character range - match any character from a to z
	[^abc]			Negated character class - match any character except a, b and c
	[^a-z]			Negated character range - do not match any character from a to z
	\[				Match a literal '['. Backslashes can escape any character with special meaning, including another backslash.
	\452			Match the character with the octal value 452 (up to 3 digits)
	\xFF			Match the character with the hex value FF (exactly 2 characters)
	\x{0000FF}		Match the character with the hex value 0000FF (exactly 6 characters)
	\n				Newline
	\a				Bell character
	\f				Form-feed character
	\r				Carriage return
	\t				Horizontal tab
	\v				Vertical tab

Perl classes:

	\d				Match any digit character ([0-9])
	\D				Match any non-digit character ([^0-9])
	\w				Match any word character ([a-zA-Z0-9_])
	\W				Match any non-word character ([^a-zA-Z0-9_])
	\s				Match any whitespace character ([ \t\n])
	\S				Match any non-whitespace character ([^ \t\n])

POSIX classes (inside normal character classes):

	[:digit:]		All digit characters ([0-9])
	[:upper:]		All upper-case letters ([A-Z])
	[:lower:]		All lower-case letters ([a-z])
	[:alpha:]		All letters ([a-zA-Z])
	[:alnum:]		All alphanumeric characters ([a-zA-Z0-9])
	[:xdigit:]		All hexadecimal characters ([a-fA-F0-9])
	[:blank:]		All blank characters ([ \t])
	[:space:]		All whitespace characters ([ \t\n\r\f\v])
	[:cntrl:]		All control characters ([\x00-\x1F\x7F])
	[:punct:]		All punctuation characters
	[:graph:]		All graphical characters ([\x21-\x7E])
	[:print:]		All graphical characters + space ([\x20-\x7E])
	[:word:]		All word characters (\w)
	[:ascii:]		All ASCII values ([\x00-\x7F])

Composition:

	def				Match d, followed by e, followed by f
	x|y				Match x or y (prefer x)
	xy|z			Match xy or z (prefer xy)

Repitition (always greedy, preferring more):

	x*				Match x zero or more times
	x+				Match x one or more times
	x?				Match x zero or one time
	x{m,n}			Match x between m and n times (inclusive)
	x{m,}			Match x atleast m times
	x{,n}			Match x between 0 and n times (inclusive)
	x{m}			Match x exactly m times

Grouping:

	(expr)			Create a capturing group. The contents of the group can be retrieved with [FindAllMatches]
	x(y|z)			Match x followed by y or z. Given a successful match, the contents of group 1 will include either y or z
	(?:expr)		Create a non-capturing group. The contents of the group aren't saved.
	x(?:y|z)		Match x followed by y or z. No groups are created.

Assertions:

	^				Match at the start of the input string. If RE_MULTILINE is enabled, it also matches at the start of every line.
	$				Match at the end of the input string. If RE_MULTILINE is enabled, it also matches at the end of every line.
	\A				Always match at the start of the string, regardless of RE_MULTILINE
	\z				Always match at the end of the string, regardless of RE_MULTILINE
	\b				Match at a word boundary (a word character followed by a non-word character, or vice-versa)
	\B				Match at a non-word boundary (a word character followed by a word character, or vice-versa)

Lookarounds:

	x(?=y)			Positive lookahead - Match x if followed by y
	x(?!y)			Negative lookahead - Match x if NOT followed by y
	(?<=x)y			Positive lookbehind - Match y if preceded by x
	(?<!x)y			Negative lookbehind - Match y if NOT preceded by x

Numeric ranges:

	<x-y>			Match any number from x to y (inclusive) (x and y must be positive numbers)
	\<x				Match a literal '<' followed by x

# Key Differences with regexp

The engine and the API differ from [regexp] in a few ways, some of them very subtle.
The key differences are mentioned below.

1. Greediness:

This engine currently does not support non-greedy operators.

2. Byte-slices and runes:

My engine does not support byte-slices. When a matching function receives a string, it converts it into a
rune-slice to iterate through it. While this has some space overhead, the convenience of built-in unicode
support made the tradeoff worth it.

3. Return values

Rather than using primitives for return values, my engine defines two types that are used as return
values: a [Group] represents a capturing group, and a [Match] represents a list of groups.

[regexp] specifies a regular expression that gives a list of all the matching functions that it supports. The
equivalent expression for this engine is shown below. Note that 'Index' is the default.

	Find(All)?(String)?(Submatch)?

[Reg.Find] returns the index of the leftmost match in the string.

If a function contains 'All' it returns all matches instead of just the leftmost one.

If a function contains 'String' it returns the matched text, rather than the index in the string.

If a function contains 'Submatch' it returns the match, including all submatches found by
capturing groups.

The term '0-group' is used to refer to the 0th capturing group of a match (which is the entire match).
Given the following regex:

	x(y)

and the input string:

	xyz

The 0th group would contain 'xy' and the 1st group would contain 'y'. Any matching function without 'Submatch' in its name
returns the 0-group.

# Feature Differences

The following features from [regexp] are (currently) NOT supported:
 1. Named capturing groups
 2. Non-greedy operators
 3. Unicode character classes
 4. Embedded flags (flags are instead passed as arguments to [Compile])
 5. Literal text with \Q ... \E

The following features are not available in [regexp], but are supported in my engine:
 1. Lookarounds
 2. Numeric ranges

I hope to shorten the first list, and expand the second.
*/
package regex
Started working on package-level documentation 1 month ago			`/*`
Wrote documentation on syntax 1 month ago			`Package regex implements regular expression search, using a custom non-bracktracking engine with support for lookarounds and numeric ranges.`

			`The engine relies completely on UTF-8 codepoints. As such, it is capable of matching characters`
			`from other languages, emojis and symbols.`

Updated documentation 4 weeks ago			`The API and regex syntax are largely compatible with that of the stdlib's [regexp], with a few key differences (see 'Key Differences with regexp').`

Wrote documentation on syntax 1 month ago			`The full syntax is specified below.`

			`# Syntax`

			`Single characters:`

			`. Match any character. Newline matching is dependent on the RE_SINGLE_LINE flag.`
			`[abc] Character class - match a, b or c`
			`[a-z] Character range - match any character from a to z`
			`[^abc] Negated character class - match any character except a, b and c`
			`[^a-z] Negated character range - do not match any character from a to z`
			`\[ Match a literal '['. Backslashes can escape any character with special meaning, including another backslash.`
			`\452 Match the character with the octal value 452 (up to 3 digits)`
			`\xFF Match the character with the hex value FF (exactly 2 characters)`
			`\x{0000FF} Match the character with the hex value 0000FF (exactly 6 characters)`
			`\n Newline`
			`\a Bell character`
			`\f Form-feed character`
			`\r Carriage return`
			`\t Horizontal tab`
			`\v Vertical tab`

			`Perl classes:`

			`\d Match any digit character ([0-9])`
			`\D Match any non-digit character ([^0-9])`
			`\w Match any word character ([a-zA-Z0-9_])`
Fixed mistake in docs 4 weeks ago			`\W Match any non-word character ([^a-zA-Z0-9_])`
Wrote documentation on syntax 1 month ago			`\s Match any whitespace character ([ \t\n])`
			`\S Match any non-whitespace character ([^ \t\n])`

			`POSIX classes (inside normal character classes):`

			`[:digit:] All digit characters ([0-9])`
			`[:upper:] All upper-case letters ([A-Z])`
			`[:lower:] All lower-case letters ([a-z])`
			`[:alpha:] All letters ([a-zA-Z])`
			`[:alnum:] All alphanumeric characters ([a-zA-Z0-9])`
			`[:xdigit:] All hexadecimal characters ([a-fA-F0-9])`
			`[:blank:] All blank characters ([ \t])`
			`[:space:] All whitespace characters ([ \t\n\r\f\v])`
			`[:cntrl:] All control characters ([\x00-\x1F\x7F])`
			`[:punct:] All punctuation characters`
			`[:graph:] All graphical characters ([\x21-\x7E])`
			`[:print:] All graphical characters + space ([\x20-\x7E])`
			`[:word:] All word characters (\w)`
			`[:ascii:] All ASCII values ([\x00-\x7F])`

			`Composition:`

			`def Match d, followed by e, followed by f`
Updated documentation 4 weeks ago			`x\|y Match x or y (prefer x)`
			`xy\|z Match xy or z (prefer xy)`
Wrote documentation on syntax 1 month ago
			`Repitition (always greedy, preferring more):`

			`x* Match x zero or more times`
			`x+ Match x one or more times`
			`x? Match x zero or one time`
			`x{m,n} Match x between m and n times (inclusive)`
			`x{m,} Match x atleast m times`
			`x{,n} Match x between 0 and n times (inclusive)`
			`x{m} Match x exactly m times`

			`Grouping:`

			`(expr) Create a capturing group. The contents of the group can be retrieved with [FindAllMatches]`
			`x(y\|z) Match x followed by y or z. Given a successful match, the contents of group 1 will include either y or z`
			`(?:expr) Create a non-capturing group. The contents of the group aren't saved.`
			`x(?:y\|z) Match x followed by y or z. No groups are created.`

			`Assertions:`

			`^ Match at the start of the input string. If RE_MULTILINE is enabled, it also matches at the start of every line.`
			`$ Match at the end of the input string. If RE_MULTILINE is enabled, it also matches at the end of every line.`
			`\A Always match at the start of the string, regardless of RE_MULTILINE`
			`\z Always match at the end of the string, regardless of RE_MULTILINE`
			`\b Match at a word boundary (a word character followed by a non-word character, or vice-versa)`
			`\B Match at a non-word boundary (a word character followed by a word character, or vice-versa)`

Added lookarounds and numeric ranges to documentation 1 month ago			`Lookarounds:`

			`x(?=y) Positive lookahead - Match x if followed by y`
			`x(?!y) Negative lookahead - Match x if NOT followed by y`
			`(?<=x)y Positive lookbehind - Match y if preceded by x`
			`(?<!x)y Negative lookbehind - Match y if NOT preceded by x`

			`Numeric ranges:`

			`<x-y> Match any number from x to y (inclusive) (x and y must be positive numbers)`
Updated documentation 4 weeks ago			`\<x Match a literal '<' followed by x`
Added lookarounds and numeric ranges to documentation 1 month ago
Updated documentation 1 month ago			`# Key Differences with regexp`
Wrote documentation on syntax 1 month ago
Updated documentation 4 weeks ago			`The engine and the API differ from [regexp] in a few ways, some of them very subtle.`
Updated documentation 1 month ago			`The key differences are mentioned below.`

Updated documentation 1 month ago			`1. Greediness:`
Updated documentation 1 month ago
Removed obsolete documentation 4 weeks ago			`This engine currently does not support non-greedy operators.`
Updated documentation 1 month ago
Updated documentation 1 month ago			`2. Byte-slices and runes:`
Updated documentation 1 month ago
			`My engine does not support byte-slices. When a matching function receives a string, it converts it into a`
			`rune-slice to iterate through it. While this has some space overhead, the convenience of built-in unicode`
			`support made the tradeoff worth it.`
Added more documentation 1 month ago
			`3. Return values`

			`Rather than using primitives for return values, my engine defines two types that are used as return`
			`values: a [Group] represents a capturing group, and a [Match] represents a list of groups.`

			`[regexp] specifies a regular expression that gives a list of all the matching functions that it supports. The`
Updated documentation 4 weeks ago			`equivalent expression for this engine is shown below. Note that 'Index' is the default.`
Added more documentation 1 month ago
			`Find(All)?(String)?(Submatch)?`

			`[Reg.Find] returns the index of the leftmost match in the string.`

			`If a function contains 'All' it returns all matches instead of just the leftmost one.`

Updated documentation 4 weeks ago			`If a function contains 'String' it returns the matched text, rather than the index in the string.`
Added more documentation 1 month ago
			`If a function contains 'Submatch' it returns the match, including all submatches found by`
			`capturing groups.`

			`The term '0-group' is used to refer to the 0th capturing group of a match (which is the entire match).`
			`Given the following regex:`

			`x(y)`

			`and the input string:`

			`xyz`

			`The 0th group would contain 'xy' and the 1st group would contain 'y'. Any matching function without 'Submatch' in its name`
			`returns the 0-group.`
Updated documentation 4 weeks ago
			`# Feature Differences`

			`The following features from [regexp] are (currently) NOT supported:`
			`1. Named capturing groups`
			`2. Non-greedy operators`
			`3. Unicode character classes`
Updated docs 4 weeks ago			`4. Embedded flags (flags are instead passed as arguments to [Compile])`
Updated documentation 4 weeks ago			`5. Literal text with \Q ... \E`

			`The following features are not available in [regexp], but are supported in my engine:`
			`1. Lookarounds`
			`2. Numeric ranges`

Updated documentation 4 weeks ago			`I hope to shorten the first list, and expand the second.`
Started working on package-level documentation 1 month ago			`*/`
			`package regex`