• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Pattern throws a StackOverflowError

 
Sheriff
Posts: 11604
178
Hibernate jQuery Eclipse IDE Spring MySQL Database AngularJS Tomcat Server Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
We have a few regex patterns which were working fine. But today suddenly a StackOverflowError was thrown when processing a (very) large string (2.5 A4 pages). I created a little program which is a simplification of the actual pattern, but with the same result.



This works fine and results in true as output. But when you add 1 iteration more to the loop you have a StackOverflowError.

I know I can solve this issue by adding the possessive quantifier to the * in the regex ("\\d+(,\\d+)*+"). But the 3 different quantifiers (greedy, reluctant and possessive) are a bit of a black box to me, so I have no clue why possessive quantifier solves the issue (and greedy or reluctant don't).
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Roel De Nijs wrote:I know I can solve this issue by adding the possessive quantifier to the * in the regex ("\\d+(,\\d+)*+"). But the 3 different quantifiers (greedy, reluctant and possessive) are a bit of a black box to me, so I have no clue why possessive quantifier solves the issue (and greedy or reluctant don't).


I suspect because you've got nested qualifiers (Henry will correct me if I'm wrong ). This site is a really good reference for all things regex, and it suggests that you may have catastrophic backtracking. Possessive qualifiers short-circuit that.

Winston
 
Roel De Nijs
Sheriff
Posts: 11604
178
Hibernate jQuery Eclipse IDE Spring MySQL Database AngularJS Tomcat Server Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
That's a great resource. I'm pretty sure the nested qualifiers are the problem here, with possible catastrophic backtracking But i have no clue why the possessive qualifier short-circuit this one (and what's the difference with the greedy one), so time for some experimenting.

I downloaded RegexBuddy and gave it a try to see if there is a difference. First I tested the Matching CSV Records example, got the exact same output as mentioned on that website.

Now it's time to test my regex \d+(,\d+)* with input string 9,8,7,6,5,4,3,2,1. Result: match found in 26 steps with just 1 mention of "backtrack" (in the last step). When I do exactly the same with the complete string created by the program from the original post (so 1322,1321,1320,...,3,2,1,0) I get exactly the same results: match found in 3968 steps with just 1 mention of "backtrack" (in the last step).

When I add the possessive quantifier to my regex \d+(,\d+)*+ and execute the tests again I'll have exactly the same results as with the greedy one. So no difference here between greedy and possessive. That makes me even more

As an extra experiment I changed the regex to \d+(,\d+)*A. This change gives a huge difference in the results. With the greedy quantifier you need 5292 steps (and a lot of "backtrack"s), with the possessive one you just need 3969 steps.
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Roel De Nijs wrote:But i have no clue why the possessive qualifier short-circuit this one (and what's the difference with the greedy one)...


Well, from what I understand (although I have to say that they're fairly new to me too), possessive qualifiers allow the engine to say "right, I'm done with that" once it finds a substring that can't possibly match.

But more than that...

I suspect you'll know more than me in a couple of weeks.

Winston
 
Roel De Nijs
Sheriff
Posts: 11604
178
Hibernate jQuery Eclipse IDE Spring MySQL Database AngularJS Tomcat Server Chrome Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Today I played an hour with RegexBuddy and regular expressions which gave me better insights in this issue.

I think part of the problem is the (recursive) implementation of Pattern class. Because in my example JustATest the JVM throws a StackOverflowError with 1324 integers added to the string. This is not at all a problem for RegexBuddy. I also did some tests with my actual regular expression (the one which was causing the problem in the 1st place). So the JVM threw the StackOverflowError with a String of 2.5 A4 pages. In the file I needed to validate with this regular expression the biggest string was as much as 22.25 (!) A4 pages. Not a problem for RegexBuddy at all (although some calculation time was needed ). So if RegexBuddy can handle this string, why can the almighty JVM not handle it?

I think the need for backtracking is very well explained in the Matching CSV Records example. I made another little example to play with and finally it cleared my doubts over this issue and gave me a better understanding about how regex matching happens. The pattern \w+(,\w+)*a and string zz,ee,rr,tt,yy,uua,ii,oo,pp show why you have backtracking (and how it could result in a match). In short: first (because of the greedy quantifier) the whole string is used to find a match, but no match (zz,ee,rr,tt,yy,uua,ii,oo,pp doesn't end with an a). Then backtracking kicks in: 1st it gives up ,pp (still no match), then ,oo (no match) and finally ,ii which results in a match (zz,ee,rr,tt,yy,uua).
 
Sheriff
Posts: 3837
66
Netbeans IDE Oracle Firefox Browser
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm no expert on this, but I believe this article I've found some time ago is relevant to the issue. I understand there are non-recursive regexp implementations (such as jawk), but I'm just guessing really.
 
reply
    Bookmark Topic Watch Topic
  • New Topic