aspose file tools*
The moose likes Java in General and the fly likes  Java regular expression optimization - help needed Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark " Java regular expression optimization - help needed" Watch " Java regular expression optimization - help needed" New topic
Author

Java regular expression optimization - help needed

dpkcv cv
Greenhorn

Joined: Sep 30, 2011
Posts: 5
Hi All,
I am new to Java Regular expression. We are using a pattern for matching a string. We are using this for validating a text field and it meets our requirements. But there is a performance issue in the matching.
pattern :([a-zA-Z0-9]+[ ]?(([_\-][a-zA-Z0-9 ])*)?[_\-]?)+

1. Input text should start with a-zA-Z0-9.
2. Space(single) is allowed between words
3. "_" and "-" are allowed but cannot be consecutive.

Our problem is, for certain input strings the CPU time goes high and causes hanging the threads. Also we get exceptions. Can anyone please
help me to optimize the Pattern or suggest a new pattern to solve my issue.

Exception details
============================================
Hung thread details, all the same:
[9/28/11 11:40:07:320 CDT] 00000003 ThreadMonitor W WSVR0605W: Thread "WebContainer : 26" (0000004f) has been active for 709755 mi
lliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung.
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3938)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3801)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3801)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.match(Pattern.java:4307)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4090)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4239)
at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4006)
at java.util.regex.Pattern$GroupCurly.match(Pattern.java:3928)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Branch.match(Pattern.java:4124)
at java.util.regex.Pattern$Ques.match(Pattern.java:3703)
at java.util.regex.Pattern$Curly.match0(Pattern.java:3794)
at java.util.regex.Pattern$Curly.match(Pattern.java:3756)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4180)
at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4323)
at java.util.regex.Pattern$Prolog.match(Pattern.java:4263)
at java.util.regex.Matcher.match(Matcher.java:1139)
at java.util.regex.Matcher.matches(Matcher.java:514)

Thanks
Deepak
Gupta Tarun
Greenhorn

Joined: Sep 16, 2010
Posts: 22

How about doing a negative check on "_ and - cannot be consecutive" i.e. should not contain '__' and '_-' and '--' and '-_'.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

Our problem is, for certain input strings the CPU time goes high and causes hanging the threads. Also we get exceptions.

That's not very helpful. When do you get high CPU? It would also be nice to see the relevant piece of code.

Also, as far as I can see, the pattern will allow something like "a_ - _ _ -". Is that what you want?

My suggestion: Write down the syntax rules in full in English; then write a pattern to match. Regexes are hard enough without trying to reverse-engineer them. It will also help anyone else that has to look at your code to understand your intentions.

I have no idea whether it's part of your problem, but the "+[ ]?(([_\-][a-zA-Z0-9 ])*)?[_\-]?)" looks rather messy to me.

From your own definition, I would say that unless a single "_" or "-" is allowed on its own, delimited by spaces, a word has the pattern:
"[-_]?([a-zA-Z0-9]+[-_]?)+"
By inference, the overall pattern would then be:
"[a-zA-Z0-9]([ ]?[-_]?([a-zA-Z0-9]+[-_]?)+)+"

But you need to test it well. Just as an added point, I tend to avoid the '*' qualifier if I can. I also try to avoid '\', because they drive me nuts in Java strings.

Winston

PS: You might also be able to improve the performance by making your groups non-capturing.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19720
    
  20

If you look at that exception stack trace you can see that there's an infinite loop in there somewhere. Without seeing the input that caused this loop we can't really tell you why this infinite loop is there.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

I think I just worked out your problem, and if I'm right it does have to do with the '*' qualifier (among other things).
([_\-][a-zA-Z0-9 ])*)?[_\-]?
is a 0-length pattern, so:
(([_\-][a-zA-Z0-9 ])*)?[_\-]?)+
says "find me 1 or more of that 0-length pattern I just gave you".

Unsurprising, therefore, that it might have problems.

Winston
dpkcv cv
Greenhorn

Joined: Sep 30, 2011
Posts: 5
It would also be nice to see the relevant piece of code.


Thanks for your reply.This is the code used for validating the string.


Also, as far as I can see, the pattern will allow something like "a_ - _ _ -". Is that what you want?


yes this is a valid input.


Maneesh Godbole
Saloon Keeper

Joined: Jul 26, 2007
Posts: 10451
    
    8

dpkcv cv wrote:
Please check your private messages for an important administrative matter.
Maneesh Godbole
Saloon Keeper

Joined: Jul 26, 2007
Posts: 10451
    
    8

In future while posting code, please UseCodeTags. I had added them to your post. As you can the code is much more easier to read and understand.
Also, you can make use of the "quote" button.
dpkcv cv
Greenhorn

Joined: Sep 30, 2011
Posts: 5
Gupta Tarun wrote:How about doing a negative check on "_ and - cannot be consecutive" i.e. should not contain '__' and '_-' and '--' and '-_'.



1. First word should contain only [0-9a-zA-Z]
2. Second word can contain any of these characters[0-9a-zA-Z],space,"_" and "-".But same characters(space,"_" and "-") cannot be consicutive.For examples "__","--"," ","_-" are invalid.But "- _" is valid.
3.
i) Hello world_-(invalid)
ii) Hello world_ -(valid)
iii) Hello world--(invalid)
iv) Hello world- -(valid)
v) Hello world_ _(valid)
vi) Hello world(invalid) - 2 spaces
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19720
    
  20

"dpkcv cv", I strongly urge you to read your private messages as Maneesh has suggested.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8008
    
  22

dpkcv cv wrote:1. First word should contain only [0-9a-zA-Z]
2. Second word can contain any of these characters[0-9a-zA-Z],space,"_" and "-".But same characters(space,"_" and "-") cannot be consicutive.For examples "__","--"," ","_-" are invalid.But "- _" is valid....

Ah. It would have been useful to have these rules at the start.

Regexes are powerful, but they're not omnipotent. It's possible you could create one to cover all your needs, but it's likely to be awfully arcane.
Why not just break down the rules, viz something like:A bit more code than yours, but much clearer than some horrendous regex I reckon.

Winston

Edit: Oops. Keep forgetting that Java matches the entire string by default .
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Java regular expression optimization - help needed