Commit Graph

36 Commits

Author SHA1 Message Date
a14ab81697 Updated function names, addeed new function 'FindString' that returns the _text_ of the match 2025-01-19 21:44:15 -06:00
1da3f7f0e0 Changed API for match-finding functions - take in a Reg instead of start state and numGroups separately 2025-01-06 20:14:19 -06:00
4373d35216 Wrote function to find the 'n'th match of a regex 2025-01-05 21:40:53 -06:00
71cab59a89 Got rid of unnecessary special case to match at end-of-string
Instead, I tweaked the rest of the matching function, so that a special
check isn't necessary. If we are trying to match at the end of a string,
we skip any of the actual matching and proceed straight to finding
0-length matches.

This change was made because, with the special case, capturing groups
weren't getting updated if we had an end-of-string match.
2024-12-12 14:49:45 -05:00
8c8e209587 Removed return values that weren't being used 2024-12-12 14:35:06 -05:00
437ca2ee57 Improved submatch tracking by storing all group indices as a part of the state, which is viewed as a 'thread' 2024-12-11 00:16:24 -05:00
00902944f6 Added code to match capturing groups and store into a Group (used to be MatchIndex) 2024-12-09 01:28:18 -05:00
cbd6ea136b If the NFA starts with an assertion, make sure it's true before doing anything else. Also, check for last-state _lookaround_ rather than just last state, before breaking (instead of aborting) when the assertion fails 2024-11-27 11:46:38 -05:00
0de3a94ce3 Fixed bug with lookaheads: f(?=f) would not match anything in 'ffa', because of the 'a' at the end of the string. Fixed by checking if there are other last states when an assertion fails, rather than immediately aborting 2024-11-24 15:04:51 -05:00
051a8551f3 Match zero-length match at end of string, even if the start node is an assertion (end of string, lookarounds, etc.) 2024-11-22 00:10:58 -05:00
2569f52552 Wrote toString function for MatchIndex 2024-11-20 01:04:31 -05:00
8a1f1dc621 Added unicode support
Replaced strings with rune-slices, which capture unicode codepoints more
accurately.
2024-11-18 10:41:50 -05:00
137ea3c746 Made findAllMatchesHelper non-recursive, added pruneIndices (improved performance) and more changes
I made findAllMatchesHelper a non-recursive function. It now only
returns the first match it finds in the string (so I should probably
rename it).

These indices are collected by findAllMatches and pruned (to
remove overlaps). The overlap function has also been rewritten, to make
it (I believe) less than O(n^2). I also used the uniq_arr type to make
checking for uniqueness O(1) instaed of O(n) (as it was with
unique_append()). This has resulted in massive performance gains.

There's been a lot of changes here, and I probably haven't documented
all of them.
2024-11-07 16:16:50 -05:00
ea17251bf8 Might have made a change to improve performance 2024-11-04 08:42:26 -05:00
1d9d1a5b81 Fixed calculation of overlapping (used to check for subset instead) 2024-11-03 14:36:23 -05:00
dca81c1796 Replaced rune-slice parameters with string parameters in functions; avoids unnecessary conversion from strings to rune-slices 2024-11-01 01:53:50 -04:00
315f68df12 Fixed typo 2024-10-31 17:55:41 -04:00
360bdc8e11 Big rewrite - assertion handling, zero-match fixes, change in recursive calls
I added support for transitions. I wrote a function to determine if
a given state has transitions for a character at a given point in the
string. This helps me check if the current state has an assertion, and
take actions based on that.

I also fixed zero-length matching (almost, see todo.txt). It works for
nearly all cases I could think of, although I still need to write more
tests. I wrote a function to check if zero-length matches are possible
with a given state.

I also changed the way recursive calls work. Rather than passing a
modified string, the function stores the location in the input string.
This location is updated with each call to the function.

Finally, the function now increments the offset by 1 instead of
incrementing by the length of the longest match. This leads to a bit of
overhead eg. if a regex matches index 1-5, then 1-5, 2-5, 3-5, 4-5 are
all stored. To fix this, I wrote (and used) a function to check if
a match overlaps with any matches in a slice.
2024-10-31 17:06:32 -04:00
8e8e9e133f Fixed matching greediness eg. a(a|b)*a would not match 'aaa' in 'aaab' 2024-10-29 20:07:30 -04:00
4f2f14212c Use contains function, since the content may have multiple characters 2024-10-28 17:37:55 -04:00
df6efcd1f0 Unique append to match indices (ensure match indices aren't repeated 2024-10-28 15:44:37 -04:00
fe5c94b4df Use new unique append to check if unique states have been added to tempStates 2024-10-28 09:40:41 -04:00
13a57a4347 Stricter check for adding zero-length match at end of string 2024-10-28 00:58:10 -04:00
cda0dfb0cc Match empty string if start state is kleene star 2024-10-27 15:11:12 -04:00
95654e3e34 Take all possible 0-states (until no more left to take) before checking if we are in an acceptable position 2024-10-27 14:56:28 -04:00
c9fdf5aa6c Restored old behavior with end-of-string - new one didn't seem to work well 2024-10-27 11:18:42 -04:00
cd2b800b04 Fixed greediness of kleene star 2024-10-26 13:21:00 -04:00
139c88dd58 Started working on '+' operator 2024-10-24 14:39:28 -04:00
c894ee4c0d Renamed match function to 'findAllMatches', to better represent what it does 2024-10-24 12:31:37 -04:00
ce156c4405 Fixed kleene star matching at end of string - failed test a* and ppppppppaaaaaaaa 2024-10-23 14:42:35 -04:00
9d786997df Initial support for multiple matching 2024-10-23 11:18:45 -04:00
60b798d904 Working on multiple matching 2024-10-23 10:37:34 -04:00
11dd6aeb7c More Kleene star fixes 2024-10-23 10:26:50 -04:00
9d3bc2b804 Fixed kleene star behavior, which used to behave like a '+' 2024-10-23 08:51:49 -04:00
bc11777ad5 Fixed Kleene Star matching 2024-10-22 17:07:01 -04:00
d191686168 Rudimentary matching works 2024-10-22 16:25:49 -04:00