aspose file tools*
The moose likes Java in General and the fly likes regex problem Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Soft Skills this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "regex problem" Watch "regex problem" New topic
Author

regex problem

sahar eb
Ranch Hand

Joined: May 06, 2009
Posts: 38
Hi
I have this string: 5,675,510, 5,796,952, 6,115,680, 6,108,637, 6,138,155, 6,643,696, and 6,763,386
and It stuck in this method:


what I was trying to do with the regex was to capture any string which is a mixture of time expression (those like 1 am or 2:30 and ..) and mixtures of numbers with symbols. it perfectly matches every thing that I want but ...
the problem is, when I give the this string:"5,675,510, 5,796,952, 6,115,680, 6,108,637, 6,138,155, 6,643,696, and 6,763,386" as text to this method my whole program goes on halt!!
why? is there any problem in my regex?

thank you so much!
Sahar.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8414
    
  23

sahar eb wrote:why? is there any problem in my regex?

Yes. Simply put: it's unreadable. And if you can't work out whether it's right or not, how on earth is your program supposed to?

A description like "(those like 1 am or 2:30 and ..)" is totally inadequate for a "time expression", which would appear to be what you want to find. You need to describe the rules (ALL the rules) exactly.

Furthermore, the name 'isOnlyNumbers' suggests that it was never created to find a time expression at all.

My advice: start again by the describing the exact logic for finding a time expression in English, and then create a regex to do it.

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Joe Areeda
Ranch Hand

Joined: Apr 15, 2011
Posts: 318
    
    2

The best advice I got for working with regular expressions was to create a simple program to use as a test harness. It reads a regex and a string then prints out what you get for match, group.

I find it's so much easier to debug regex string separate from the program that uses it.

I've attached the one I use. Please be kind, this is one of the first Java programs I've written and every time I use it I want to spend some time cleaning it up. But it does do what I need and I always end up saying "Maybe next time".

So be kind, or better yet improve it and pass it on.

Joe

Oh well I guess I can't attach a Java or a zip file so I'll find someplace to post it but it's easy enough to write your own.


It's not what your program can do, it's what your users do with the program.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8414
    
  23

Joe Areeda wrote:The best advice I got for working with regular expressions was to create a simple program to use as a test harness. It reads a regex and a string then prints out what you get for match, group.

Not bad advice; but the best advice I ever got about them was: consider whether you really need one...and I say that as an old Sysadmin who's used them a lot.

For straightforward patterns, they're an absolute godsend; but there comes a point where the complexity of any pattern (particularly if it contains a lot of "OR" logic, or backspacing) makes it better to break it up or simply solve it with code.

A classic example is tag parsing - regexes simply aren't suitable for any sort of hierarchical or recursive logic - but it also applies to general patterns too. For example, would you want to have to test this one?

@sahar: It's worth remembering this old chestnut, generally attributed to Jamie Zawinski:
"Some people, when confronted with a problem, think “I know, I'll use regular expressions”. Now they have two problems."
However, if you want a good guide on how to write regexes, have a look at this site.

Winston
Joe Areeda
Ranch Hand

Joined: Apr 15, 2011
Posts: 318
    
    2

As usual Winston has some good advice for you.

My impression of regex is a bit less extreme but generally in agreement.

Regular expressions have their uses but are hardly ever the only solution to a problem.

One of their pitfalls is they can easily get way too complicated and therefore unreadable like the one you have in the original post.

So to address your problem the first thing I would do is to simplify it.

Instead of trying to use one regex to parse all the possible inputs, one possible approach is to use a series of simpler expressions to see what kind of input it is then a separate method to parse each one.

something like



I'm not sure what you want to do with all those special symbols. But if you want to interpret them as an expression you may want to look into a real parser generator like ANTLR (http://www.antlr.org/) which is my current favorite.

Joe
sahar eb
Ranch Hand

Joined: May 06, 2009
Posts: 38
Hi thanks for the replies.
yes, my bad, I didn't need something to catch all kind of time expressions. I needed something to catch only these patterns: 2:30 or 2,30 or 2-5 or 12 AM or 3:40 PM,... if there is no other thing in the text except them.
I went through the tutorial and I come up with this:
boolean isDayTime = text.trim().matches("^(\\d)+((:|,|-)(\\d)*){1,2}(\\s)*((?i)(AM|PM))?$");
I can't find my own problems and I can't risk passing this to the program again. can you tell me if it has any problem?

and the other thing that I don't understand is that this match function is suppose to give me back true or false. even if they are not the right feed backs for my patterns, but I don't understand why my program chokes?

thank you so much!
Joe Areeda
Ranch Hand

Joined: Apr 15, 2011
Posts: 318
    
    2

Sahar,

Your regex is still more complicated for me to debug.

I don't understand how your trying to match what you want.

What does choke mean? Are you saying the match statement never returns? If so I think it's some sort of backtracking issue that either never resolves or takes for eff-ing ever to resolve.

As far as risking passing to the program again, let me repeat my suggestion to write some sort of test harness to work out your regex before you put it into a large program.

If it were me:
a) I'd try to limit the input format complexity. May not be possible if this is reading already existing data.
b) I would first go over the input string and separate them into individual times and then try to figure out which of the gazillion options they are using on each of them.

Joe
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
Looking back at the original regex, the basic problem seems to be that it has too many greedy quantifiers (* or +), giving it poor performance on longer strings. Since greedy quantifiers allow backtracking every time a match fails, it tries to backtrack from its previous matches and find a new match. When you have several of these, as in the expression given, and they can each match almost any number of things in the input string... well, you get a combinatorial explosion of things it tries to match. And it takes forever.

See the thing is, your program is not actually blocked. It's just taking a really long time to try all the possibilities. Because your regex is inefficient. It's fairly quick for finding a match, when a match exists. But if there's no match (as for your input, which contains "and" which is not a legal number or symbol) it keeps trying other possibilities, hoping one of the others will work. And it takes a long time to do that.

As a simple fix, I replaced the last "+" (the third-to-last character) with "++", and it works much faster. This means replacing greedy matching with possessive matching - no backtracking. At least, no backtracking for that part of the match. There are still several other greedy expressions in there. But removing just one of them seems to have done the trick.

That said, I agree that the regex is unnecessarily complex. It would help a lot to simplify it, if possible, or to replace the whole thing with some non-regex code to do the same thing.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: regex problem