• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

regex problem

 
Ranch Hand
Posts: 38
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi
I have this string: 5,675,510, 5,796,952, 6,115,680, 6,108,637, 6,138,155, 6,643,696, and 6,763,386
and It stuck in this method:


what I was trying to do with the regex was to capture any string which is a mixture of time expression (those like 1 am or 2:30 and ..) and mixtures of numbers with symbols. it perfectly matches every thing that I want but ...
the problem is, when I give the this string:"5,675,510, 5,796,952, 6,115,680, 6,108,637, 6,138,155, 6,643,696, and 6,763,386" as text to this method my whole program goes on halt!!
why? is there any problem in my regex?

thank you so much!
Sahar.
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

sahar eb wrote:why? is there any problem in my regex?


Yes. Simply put: it's unreadable. And if you can't work out whether it's right or not, how on earth is your program supposed to?

A description like "(those like 1 am or 2:30 and ..)" is totally inadequate for a "time expression", which would appear to be what you want to find. You need to describe the rules (ALL the rules) exactly.

Furthermore, the name 'isOnlyNumbers' suggests that it was never created to find a time expression at all.

My advice: start again by the describing the exact logic for finding a time expression in English, and then create a regex to do it.

Winston
 
Ranch Hand
Posts: 334
2
Netbeans IDE Tomcat Server Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The best advice I got for working with regular expressions was to create a simple program to use as a test harness. It reads a regex and a string then prints out what you get for match, group.

I find it's so much easier to debug regex string separate from the program that uses it.

I've attached the one I use. Please be kind, this is one of the first Java programs I've written and every time I use it I want to spend some time cleaning it up. But it does do what I need and I always end up saying "Maybe next time".

So be kind, or better yet improve it and pass it on.

Joe

Oh well I guess I can't attach a Java or a zip file so I'll find someplace to post it but it's easy enough to write your own.
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Joe Areeda wrote:The best advice I got for working with regular expressions was to create a simple program to use as a test harness. It reads a regex and a string then prints out what you get for match, group.


Not bad advice; but the best advice I ever got about them was: consider whether you really need one...and I say that as an old Sysadmin who's used them a lot.

For straightforward patterns, they're an absolute godsend; but there comes a point where the complexity of any pattern (particularly if it contains a lot of "OR" logic, or backspacing) makes it better to break it up or simply solve it with code.

A classic example is tag parsing - regexes simply aren't suitable for any sort of hierarchical or recursive logic - but it also applies to general patterns too. For example, would you want to have to test this one?

@sahar: It's worth remembering this old chestnut, generally attributed to Jamie Zawinski:
"Some people, when confronted with a problem, think “I know, I'll use regular expressions”. Now they have two problems."
However, if you want a good guide on how to write regexes, have a look at this site.

Winston
 
Joe Areeda
Ranch Hand
Posts: 334
2
Netbeans IDE Tomcat Server Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As usual Winston has some good advice for you.

My impression of regex is a bit less extreme but generally in agreement.

Regular expressions have their uses but are hardly ever the only solution to a problem.

One of their pitfalls is they can easily get way too complicated and therefore unreadable like the one you have in the original post.

So to address your problem the first thing I would do is to simplify it.

Instead of trying to use one regex to parse all the possible inputs, one possible approach is to use a series of simpler expressions to see what kind of input it is then a separate method to parse each one.

something like



I'm not sure what you want to do with all those special symbols. But if you want to interpret them as an expression you may want to look into a real parser generator like ANTLR (http://www.antlr.org/) which is my current favorite.

Joe
 
sahar eb
Ranch Hand
Posts: 38
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi thanks for the replies.
yes, my bad, I didn't need something to catch all kind of time expressions. I needed something to catch only these patterns: 2:30 or 2,30 or 2-5 or 12 AM or 3:40 PM,... if there is no other thing in the text except them.
I went through the tutorial and I come up with this:
boolean isDayTime = text.trim().matches("^(\\d)+((:|,|-)(\\d)*){1,2}(\\s)*((?i)(AM|PM))?$");
I can't find my own problems and I can't risk passing this to the program again. can you tell me if it has any problem?

and the other thing that I don't understand is that this match function is suppose to give me back true or false. even if they are not the right feed backs for my patterns, but I don't understand why my program chokes?

thank you so much!
 
Joe Areeda
Ranch Hand
Posts: 334
2
Netbeans IDE Tomcat Server Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sahar,

Your regex is still more complicated for me to debug.

I don't understand how your trying to match what you want.

What does choke mean? Are you saying the match statement never returns? If so I think it's some sort of backtracking issue that either never resolves or takes for eff-ing ever to resolve.

As far as risking passing to the program again, let me repeat my suggestion to write some sort of test harness to work out your regex before you put it into a large program.

If it were me:
a) I'd try to limit the input format complexity. May not be possible if this is reading already existing data.
b) I would first go over the input string and separate them into individual times and then try to figure out which of the gazillion options they are using on each of them.

Joe
 
Master Rancher
Posts: 4806
72
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Looking back at the original regex, the basic problem seems to be that it has too many greedy quantifiers (* or +), giving it poor performance on longer strings. Since greedy quantifiers allow backtracking every time a match fails, it tries to backtrack from its previous matches and find a new match. When you have several of these, as in the expression given, and they can each match almost any number of things in the input string... well, you get a combinatorial explosion of things it tries to match. And it takes forever.

See the thing is, your program is not actually blocked. It's just taking a really long time to try all the possibilities. Because your regex is inefficient. It's fairly quick for finding a match, when a match exists. But if there's no match (as for your input, which contains "and" which is not a legal number or symbol) it keeps trying other possibilities, hoping one of the others will work. And it takes a long time to do that.

As a simple fix, I replaced the last "+" (the third-to-last character) with "++", and it works much faster. This means replacing greedy matching with possessive matching - no backtracking. At least, no backtracking for that part of the match. There are still several other greedy expressions in there. But removing just one of them seems to have done the trick.

That said, I agree that the regex is unnecessarily complex. It would help a lot to simplify it, if possible, or to replace the whole thing with some non-regex code to do the same thing.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
reply
    Bookmark Topic Watch Topic
  • New Topic