I need some help on simple regular expression. Just say I have a string for example: String str = "start start start match1 finish finish finish"; I would like to extract the smallest string that is bounded by "start" and "finish"... ie. in this case I would like to extract the string "match1". My first guess was to use: pattern = "start (.*?) finish"; however this matches the string "start start match1". (This behaviour is identical to using the same string and pattern with Perl5). I know I can use the pattern: pattern = ".*start (.*?) finish"; and get what I want, however there's a performance issue by using the greedy matcher .* at the start of the pattern. (This performance issue is specifically with Jakarta ORO, however I've tried GNU Regexp and has similar performance issues by using multiple greedy and non-greedy matchers in my regexp patterns.) Is there another single regular expression that will extract "match1" from the above $str when you only know it's bounded by "start" and "finish". Any help would be greatly appreciated.
I dunno about another single regex solution. There may be something possible with the lookahead/lookback operators, but I couldn't get an expression that worked. Plus everything I tried was complex enough that it would probably be at least as inefficient as ".*start (.*) finish" was. I'd just forget about regexes here and use nice efficient String methods:
Basically this looks for a start, then the next finish, then looks back to see if there were any more starts between the first start and first subsequent finish. This can find more than one match, e.g. for the input String str = "start start foo finish start bar finish finish"; there are two matches: "foo" and "bar". Note that "start finish" has no matches, but "start finish" (with two spaces) yields a match of an empty string. That was true for your regex solution as well; if you don't want that, the code can be tweaked as necessary. (E.g. leave spaces out of the start and finish strings when you search for them, and then trim() the final result.) Enjoy...
You're close. The missing ingredient is that in your group "(.*)" you do not want any occurrence of the start pattern. The negative zero-width lookahead matcher "(?!pattern)" can be inserted into the group to achieve this: pattern = "start ((??! start ).)*?) finish"; I should add that this is using JDK 1.4 patterns. I don't know if ORO or Gnu regexp support this. - Peter [ March 12, 2003: Message edited by: Peter den Haan ]
Joined: Jan 30, 2000
Peter - did you test that on Adam's sample pattern? For me it matches "start match1" rather than the desired "match1". Furthermore, what about something like this: "start foo start bar finish finish" I'm thinking this should match "bar" only. But in your pattern the negative lookahead only looks for a "start" at the beginning of group 1. (And only one "start", which is why it fails the original test input.) To make this work as I believe it's supposed to, you'd need to add some wild cards in the lookahead. E.g. "start ((??!.*start)+.)*?) finish" This matches "match1" fine. But then what about something like "start foo bar start" Here the final "start" is detected by the lookahead (even though it's after the "finish") and prevents a match being found. I don't see a good way around this. Also I strongly suspect the performance of this pattern will be at least as bad as the original ".*start(.*?) finish" which was apparently not good enough. (Though I suppose the usual caveats about optimization apply.) ;) The problem isn't that difficult if we don't insist on trying to do it with a single regex. Regexes are cool, but too often people try to do too much with them, IMO. [ March 12, 2003: Message edited by: Jim Yingst ]
The point here is that what's in the parens is a full fledged regex and can be quantified by it self as an atomic unit. Here is the second example in some sort of civilized way.
Regex, the most rewarding way of text processing! [ March 12, 2003: Message edited by: Leslie Chaim ]
Normal is in the eye of the beholder
Joined: Mar 11, 2003
Thank you all for your help. I think the question can be simplified by saying: is it possible to find the last occurance of "start (.*?) finish" in any string without using the .* (or .+) greedy matcher? Leslie Chaim, I am impressed by your regexp mastery... and I may be wrong, but I think the use of [a-z]+ equates to using a greedy matcher... but I should have been more specific with my question. Okay... I'll tell you what I'm trying to do... I have hundreds of HTML pages (averaging 100-200K) with hundreds of links and I want to extract links with specific data contained within the anchor tags. For example:
I would like to extract the link in the anchor tag that bounds the string ">Page 2 Info<" without using the .* greedy matcher. All I know is that this string exists only once on the page. Is it possible to do this with one regular expression? The .* greedy matcher causes major performance issues with the ORO regexp package (and a few other popular java regexp packages I've tried). In addition, I am running this on a FreeBSD machine - the highest supported (stable) version of jdk is 1.3.1. No access to the java.util.regex package for me. Your help is greatly appreciated. [ March 12, 2003: Message edited by: Jim Yingst ]
Joined: Jan 30, 2000
Oh - that should be a good deal easier. If ">Page 2 Info</a>" is constant, and we just need to grab the URL in front of it:
This takes advantage of the fact that you can't have any " inside the href. As far as the pattern is concerned it shouldn't really matter much if greedy or non-greedy matching is used - as long as quotes are excluded by the [^"] class, the result should be the same. But if ORO is happier with non-greedy, fine, use non-greedy. Or get a better regex package. Seriously, greedy matching is pretty common, I'm surprised it would be a problem. If performance is really a problem still, then you really can't do better than this for a Java-based solution:
And as Leslie would doubtless point out, I'm sure you can get a perfectly good Perl implementation on FreeBSD which will probably be faster than Java for something like this. [ March 12, 2003: Message edited by: Jim Yingst ]
Joined: May 22, 2002
Leslie Chaim, I am impressed by your regexp mastery
Thanks, but its really not hard � after you read the MRE book 6 times If you're into text processing, you may want to consider this recipe. Refer to the sixth posting of that thread. BTW, is there a way, with UBB or whatever to point to a specific post? Also, use the shorter abbreviation of regex over regexp it will be easier on your lips
Seriously, greedy matching is pretty common, I'm surprised it would be a problem.
If you read chapter 8 on Java in the MRE book there would have been no surprises. One thing that comes to my mind is that Adam's HTML page is a whole file in one string, and with greedy matching the engine marches to the end of the string and then backtracks if and only if there are subsequent elements in the regex which would cause the overall regex to fail. Since there is something in the regex that causes a backtrack and an (assumingly) long HTML string the backtracks could be a performance hog.
And as Leslie would doubtless point out, I'm sure you can get a perfectly good Perl implementation on FreeBSD which will probably be faster than Java for something like this.
Yep, and I happily will, I think that the following Perl one liner will do:
Let me know if it works or if you are having any problems. I am not that familiar with HTML, and regardless, to me its just text, text, text, and something about finding a pattern which I than feed to my regex engine. I have made a couple of assumptions with this one liner:
An HTML page is one single file ending in .html
The text in the HTML file does not have the backslashes as you presented with the String snippet.
I reused the same expression that Jim used the same advantage as Jim explained.
I am not displaying which link came from which file.
[b]And/b] I did not have real test files to check so it may not work
Originally posted by Jim Yingst: Peter - did you test that on Adam's sample pattern? For me it matches "start match1" rather than the desired "match1".
Yes, I did actually. I made just one mistake: I didn't re-test after surrounding "start" in my lookahead match with spaces! The problem is that the first "start" has already consumed the space which causes the second "start" not to match. Oops. Apologies.
Furthermore, what about something like this: "start foo start bar finish finish" I'm thinking this should match "bar" only.
It should, and before I posted I tested that my pattern did match this correctly.
But in your pattern the negative lookahead only looks for a "start" at the beginning of group 1. (And only one "start", which is why it fails the original test input.)
No it doesn't? It looks for "start" anywhere inside the matching group!
To make this work as I believe it's supposed to, you'd need to add some wild cards in the lookahead. E.g. "start ((??!.*start)+.)*?) finish"
No, been there, tried that, it'll never work because it can match beyond the group you're trying to capture.Passes all tests with flying colours. - Peter
Joined: Jan 30, 2000
[LC]: If you read chapter 8 on Java in the MRE book there would have been no surprises. Yeah, yeah - my copy of MRE doesn't have a chapter 8. But there's relevant discussion elsewhere, such as chapter 5. Guess I really should get the new edition. So what's the issue here with ORO and a simple expression like ".*start (.*?) finish"? Is it specific to ORO? Does it affect all greedy matches, or just those with .*? It seems like this should just work back from the end of the string looking for "start ", when when it finds it, look ahead for " finish". Where's the big overhead here? Maybe if there are a lot of starts after the finish, you waste time on each of these partial matches. It seems as though if the HTML is really long, it would be useful to know whether the start and finish are closer to the beginning of the file, or to the end. Seems like greedy matching will be more efficient towards the end, and non-greedy towards the beginning. Also it turns out that for Adam's application the finish is unique in the html, while start may occur in many places. So it's probably more effective to look for the finish first, and then find the start by working back from that. That's what I did in the last non-regex Java version I showed. Is there a regex equivalent to this strategy? Note that if "Page 2 Info" is closer to the start,
will work well. If it's closer to the end, then
is preferable. Since "Page 2 Info" is unique, these should yield the same result, but which one gets there quickest will vary. Peter - sorry, once I saw that the pattern didn't work as required, I was a bit lazy about analyzing exactly what it did and what was wrong with it. Overlooking a * for example. So ignore my comments on what it actually did. And it looks like I need to study how lookahead works a bit more. [PdH]: No, been there, tried that, it'll never work because it can match beyond the group you're trying to capture. Right, as I subsequently acknowledged. But it wasn't clear at this point whether there really were any subsequent starts to worry about. Your fixed version does work better than the regex scemes I had envisioned. Thanks everyone for the discussion. [ March 14, 2003: Message edited by: Jim Yingst ]