We have a few regex patterns which were working fine. But today suddenly a StackOverflowError was thrown when processing a (very) large string (2.5 A4 pages). I created a little program which is a simplification of the actual pattern, but with the same result.
This works fine and results in true as output. But when you add 1 iteration more to the loop you have a StackOverflowError.
I know I can solve this issue by adding the possessive quantifier to the * in the regex ("\\d+(,\\d+)*+"). But the 3 different quantifiers (greedy, reluctant and possessive) are a bit of a black box to me, so I have no clue why possessive quantifier solves the issue (and greedy or reluctant don't).
Roel De Nijs wrote:I know I can solve this issue by adding the possessive quantifier to the * in the regex ("\\d+(,\\d+)*+"). But the 3 different quantifiers (greedy, reluctant and possessive) are a bit of a black box to me, so I have no clue why possessive quantifier solves the issue (and greedy or reluctant don't).
I suspect because you've got nested qualifiers (Henry will correct me if I'm wrong ). This site is a really good reference for all things regex, and it suggests that you may have catastrophic backtracking. Possessive qualifiers short-circuit that.
Bats fly at night, 'cause they aren't we. And if we tried, we'd hit a tree -- Ogden Nash (or should've been).
Articles by Winston can be found here
That's a great resource. I'm pretty sure the nested qualifiers are the problem here, with possible catastrophic backtracking But i have no clue why the possessive qualifier short-circuit this one (and what's the difference with the greedy one), so time for some experimenting.
I downloaded RegexBuddy and gave it a try to see if there is a difference. First I tested the Matching CSV Records example, got the exact same output as mentioned on that website.
Now it's time to test my regex \d+(,\d+)* with input string 9,8,7,6,5,4,3,2,1. Result: match found in 26 steps with just 1 mention of "backtrack" (in the last step). When I do exactly the same with the complete string created by the program from the original post (so 1322,1321,1320,...,3,2,1,0) I get exactly the same results: match found in 3968 steps with just 1 mention of "backtrack" (in the last step).
When I add the possessive quantifier to my regex \d+(,\d+)*+ and execute the tests again I'll have exactly the same results as with the greedy one. So no difference here between greedy and possessive. That makes me even more
As an extra experiment I changed the regex to \d+(,\d+)*A. This change gives a huge difference in the results. With the greedy quantifier you need 5292 steps (and a lot of "backtrack"s), with the possessive one you just need 3969 steps.
Roel De Nijs wrote:But i have no clue why the possessive qualifier short-circuit this one (and what's the difference with the greedy one)...
Well, from what I understand (although I have to say that they're fairly new to me too), possessive qualifiers allow the engine to say "right, I'm done with that" once it finds a substring that can't possibly match.
But more than that...
I suspect you'll know more than me in a couple of weeks.
Today I played an hour with RegexBuddy and regular expressions which gave me better insights in this issue.
I think part of the problem is the (recursive) implementation of Pattern class. Because in my example JustATest the JVM throws a StackOverflowError with 1324 integers added to the string. This is not at all a problem for RegexBuddy. I also did some tests with my actual regular expression (the one which was causing the problem in the 1st place). So the JVM threw the StackOverflowError with a String of 2.5 A4 pages. In the file I needed to validate with this regular expression the biggest string was as much as 22.25 (!) A4 pages. Not a problem for RegexBuddy at all (although some calculation time was needed ). So if RegexBuddy can handle this string, why can the almighty JVM not handle it?
I think the need for backtracking is very well explained in the Matching CSV Records example. I made another little example to play with and finally it cleared my doubts over this issue and gave me a better understanding about how regex matching happens. The pattern \w+(,\w+)*a and string zz,ee,rr,tt,yy,uua,ii,oo,pp show why you have backtracking (and how it could result in a match). In short: first (because of the greedy quantifier) the whole string is used to find a match, but no match (zz,ee,rr,tt,yy,uua,ii,oo,pp doesn't end with an a). Then backtracking kicks in: 1st it gives up ,pp (still no match), then ,oo (no match) and finally ,ii which results in a match (zz,ee,rr,tt,yy,uua).
I'm no expert on this, but I believe this article I've found some time ago is relevant to the issue. I understand there are non-recursive regexp implementations (such as jawk), but I'm just guessing really.