This week's book giveaway is in the Agile and other Processes forum.
We're giving away four copies of The Mikado Method and have Ola Ellnestam and Daniel Brolund on-line!
See this thread for details.
The moose likes Java in General and the fly likes java regular expression giving stack overflow ? Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login


Win a copy of The Mikado Method this week in the Agile and other Processes forum!
JavaRanch » Java Forums » Java » Java in General
Reply Bookmark "java regular expression giving stack overflow ?" Watch "java regular expression giving stack overflow ?" New topic
Author

java regular expression giving stack overflow ?

steve labar
Ranch Hand

Joined: Sep 10, 2008
Posts: 55
I'm having a very difficult time trying to get this regular expression in java to work. I just don't understand why it is not working. It appears to work in a perl script.The idea is to check if this target is landing in html and is breaking out of a double quote in an html tag.

example:
injecting this in a http request: "xss="xss()"

would be true
would be false single ticks started tag

Regex: used to match
<\w('(\\'|[^'])*'|"(\\"|[^"])*"|[^'">])*xss="xss\(\)"

I know it must be using recursion and just keeps looking.but regex appears to be good?

please help!

Error:

Exception in thread "Thread-35"
java.lang.StackOverflowError
at java.lang.Character.codePointAt(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
at java.util.regex.Pattern$Branch.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$BranchConn.match(Unknown Source)
at java.util.regex.Pattern$CharProperty.match(Unknown Source)
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19216

Steve, next time you get a StackOverflowError, can you please shorten it? After a while it is looping, and there is no reason to show all of that. I cut off part of your stack trace; the relevant part is still there.


SCJP 1.4 - SCJP 6 - SCWCD 5
How To Ask Questions How To Answer Questions
steve labar
Ranch Hand

Joined: Sep 10, 2008
Posts: 55
yeah, sorry about that.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16690
    
  19


For something like this, it may be a good idea to provide a sample program that demostrates the error... The way you describe this, we have no idea how you are using the regex.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
John de Michele
Rancher

Joined: Mar 09, 2009
Posts: 600
Steve,

I agree with Henry: you should provide sample code. That said, just from looking at your regex, are you increasing the stack for the thread? Your regex seems to have a lot of captures and options, which would easily become very complex, and blow past the stack.

John.
steve labar
Ranch Hand

Joined: Sep 10, 2008
Posts: 55
Ok i have written a class to test this example i'm running into.It errors out just as i have mentioned above. Any help would be greatly appreciated!
I have tried attaching the text response to this post with no luck. So, instead this link's source is what im using as my content to match.

http://www.overstock.com/search?keywords='xss='xss()'&taxonomy=sto7&SearchType=Homepage&TID=D:Apparel

steve labar
Ranch Hand

Joined: Sep 10, 2008
Posts: 55
just as a side note the {0,1000} part of regex if reduced to {0,100} fixes the problem. Using * instead breaks it as well. So it seems to keep looking for match and then errors out. There should be no matches in this test. Is my regex bad? i want to blame java on this one.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16690
    
  19

This doesn't answer your questions.... but I am a bit confused with this part of the regex...



Doesn't this say match zero to 1000 characters of something that is -- either something that is not a single quote or is a single quote? Which, of course, is any character? And can be simplified as follows...



and since groups are not used, it can be further simplified as follows...



Henry
steve labar
Ranch Hand

Joined: Sep 10, 2008
Posts: 55
\\' part of the regex is looking for this \'. Because if your inside 'stuf' you need a single tick in order to break out of that string. So, i want to make sure i don't see this 'stuff\'xss because my single tick was escaped so i did not break out of the string.

Does that make sense?
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16690
    
  19

Does that make sense?


Yes, but that is not what your regex does... The backslash has special meaning for both the regex and the java string literal. If the regex was loaded from a file (or something else that is not a string literal) then it would be fine. But since, it is a string literal, you need to escape it for both the regex and Java.

So, if you want a \' as a regex literal, in the way you are using it, it needs to be escaped twice... \\\\' .... four backslashes.

Henry

steve labar
Ranch Hand

Joined: Sep 10, 2008
Posts: 55
thanks i'll try that. I was aware of this issue with java but when i was testing the matches that preivious regex was working? I'll see if that fixes things.

thanks,

Steve
steve labar
Ranch Hand

Joined: Sep 10, 2008
Posts: 55
tried this:

String s = input.replaceAll("'([^']|\\\\'){0,1000}'xss='xss\\(\\)'", "pp");
still overflows.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16690
    
  19

This get really complex with double quotes -- as the backslash and the quote has special meaning to the string literal, but only the backslash has special meaning to the regex...

So, if you want a \" as a regex literal.... first, you must escape the backslash for the regex ... \\" ... then you must escape both the backslash and double quote for the string literal.... \\\\\" ... five backslashes.

Henry
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16690
    
  19

steve labar wrote:
String s = input.replaceAll("'([^']|\\\\'){0,1000}'xss='xss\\(\\)'", "pp");


Side note. Since the left side of the OR is looking for something that is not single quote, it will match the backslash of a backslash single quote pair. This is probably not what you want. You may want to flip the two operands of the OR so that it will check the backslash single quote pair first.

Henry
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16690
    
  19

steve labar wrote:
String s = input.replaceAll("'([^']|\\\\'){0,1000}'xss='xss\\(\\)'", "pp");
still overflows.


Yea, it overflows because it's complex. For every character, it has to take a branch, and recursively handle the next character, because the regex may fail later, and it has to backtrack it.

Now, in your case, there is never a need to backtrack, since it is a linear match to the closing quote. It may be a good idea to tell the regex to *not* prep for a possible backtrack, because it won't. This is done with the possessive operator, like so...



I also flipped the two operands to the OR (alternation).

Henry
steve labar
Ranch Hand

Joined: Sep 10, 2008
Posts: 55
You my friend are magical! This does in fact work. I'm going to have to do some more research as to why the + made all the difference. I was thinking when i was trying to fix it about the greedy vs. lazy. I thought it was just being greedy. I thought if i made it lazy it would just find the minimum and stop. But that was not helping.

Question: why did you alternate the or? is it more efficient this way

Thanks a lot you have saved me much frustration.

Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 16690
    
  19

Question: why did you alternate the or? is it more efficient this way


No. It just won't work the other way -- see my post (about 2 back) for the explanation.

I was thinking when i was trying to fix it about the greedy vs. lazy. I thought it was just being greedy. I thought if i made it lazy it would just find the minimum and stop.


Based on the regex, it definitely wasn't a greedy vs reluctant issue. If you had increased the stack size, so that the regex had more memory, either greedy or reluctant would have worked.

I'm going to have to do some more research as to why the + made all the difference.


Yea. I highly recommend this. The possessive operator can be very efficient -- use less memory -- as you have seen. But in many cases, you actually want the regex to use backtracking. This example happened to work because you didn't need it -- but you need to learn how to recognize this.

Henry
 
I agree. Here's the link: http://zeroturnaround.com/jrebel - it saves me about five hours per week
 
subject: java regular expression giving stack overflow ?
 
Similar Threads
Java regular expression optimization - help needed
Regex: Infinite recurssion when extracting columns from SQL
Java regular expression optimization - help needed
Regexp Headache
Infinite recurssion when using Regex to extract column names from SQL