GeeCON Prague 2014*
The moose likes Java in General and the fly likes Why doesn't my regex pickup this line Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Java in General
Bookmark "Why doesn Watch "Why doesn New topic
Author

Why doesn't my regex pickup this line

Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Dear All,

Can anyone help identify why the regex below is not able to pickup a simple string also provided:



Your assistance would be greatly appreciated.

Thanks a lot,

Jack
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3647
    
  16

Because that regex is horrible.

What are you trying to do? There's probably a better way than using an illegible regex.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7892
    
  21

Jack Bush wrote:Can anyone help identify why the regex below is not able to pickup a simple string...

Apart from agreeing completely with Stephan, I can only say that if that's your idea of 'simple', you're on a completely different plane of existence to me.

There are however, people out there who have tried even harder (I particularly like the quote at the end).

Winston
Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Fair enough Gentlemen,

The data that I am trying to match with regex is from http://www.homepriceguide.com.au/saturday_auction_results/Melbourne.pdf. I am a novice at regex and would appreciate your advice on which part of my horrible regex that can be simplified.

Thanks in advance,

Jack
D. Ogranos
Ranch Hand

Joined: Feb 02, 2009
Posts: 214
Do you want to pick up a specific line? Lines with certain criteria? Help us by telling exactly what you are trying to achieve. (sorry I'm not going to try to decipher your regex ;)
Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi D. Ogranos,

Thanks for offering your assistance.

I want to extract every line that meet the following criteria in the order from left to right, such as the original example posted earlier:


Yes, the current regex is ugly due to my lack of experience with writing regex, hence it is far from up to scratch. That is why I need your help.

Thanks for your patience & understanding,

Jack
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18874
    
  40

Jack Bush wrote:Hi D. Ogranos,

Thanks for offering your assistance.

I want to extract every line that meet the following criteria in the order from left to right, such as the original example posted earlier:


Yes, the current regex is ugly due to my lack of experience with writing regex, hence it is far from up to scratch. That is why I need your help.

Thanks for your patience & understanding,

Jack



If you really really want to do this via regex, then I suggest you do this with 12 regexes -- one for each component.

Write test components for each regex, and make sure that all 12 regexes work -- for all possible cases in isolation. Highly document these 12 regexes so that you don't forget anything, or why you are doing something with a pattern.

Create the super regex -- but don't do it by hand. Have the program build the super regex programatically -- and test it as each component is added.

Test test and more test. If you are unsure of whether a component works or not -- test it again. It is much easier to test as the regex grows than dealing with a large monster that doesn't work.

And good luck...

Henry

D. Ogranos
Ranch Hand

Joined: Feb 02, 2009
Posts: 214
Agree with Henry Wong, and would suggest the same: split the big regex in small parts. You don't even need to create one big regex out of the parts again, you could also put all subcomponent regex patterns into an array and use a method similar to this:



It tries to find a match for each pattern in the array, and makes sure to start with each part regex after the end of the last one. Note tho that this will not be an exact match like the big regex would give! This method only checks if each component has a match in the line.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7892
    
  21

Jack Bush wrote:I want to extract every line that meet the following criteria in the order from left to right, such as the original example posted earlier:

String line = “South Yarra 5/106 Toorak Rd W 2 br "u" $566,263 SP HS South Yarra”

( i ) Starts with district name that can be up to 3 words starting with capital letter. e.g. South Yarra or Brighton-Le-Sands....
Yes, the current regex is ugly due to my lack of experience with writing regex, hence it is far from up to scratch. That is why I need your help.

Actually, it's got nothing to do with your lack of experience, it's just a horrible thing to have to parse, because you've got absolutely nothing to go on. In fact, I'd say your list of rules shows a highly organized mind - I think I'd have given up long ago.

I suspect also that whatever you come up with will be extremely brittle, and susceptible to lines that contain mistakes.

However, all the above being true, and having looked at the source data, you might be able to come up with some sort of solution by dividing the problem up a bit:
1. Eliminate lines that can't possibly be ones you want. From what I can see, those are ones with:
a. Fewer than six words (and String.split() can help you there).
b. The first word doesn't start with a capital letter (although that's already opening you up to missing badly coded lines (or indeed a district name that doesn't start with a capital letter - got any Dutch communities in Melbourne?)).

2. Anchor on specific codes that are unambiguous: the Type, Prices and Result columns look like a good bet to me. You might also want to include these in your elimination phase.
As an example, a line that contains ' 3 br h ' must have its District and Address in the words before that string; ' h ', on the other hand, may not be so definitive.
' N/A ', or a '$'-sign followed by a number also seem to be good indicators of a price.

3. As others have said: break up the regex. In fact, for something like this, I'd be tempted to use Matcher and Pattern and apply your rules procedurally.

Of course, there's always the alternative: Tell the person who told you to do this to take a long walk off a short pier.
Anything you come up with is likely to be brittle and error-prone, and will be a 90% solution at best (maybe 95 if you're lucky). If that's fine, don't let me stop you.

Winston
Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Thank you so much to Henry, D. Orgranos & Winston for the excellent strategies and quality code that I would not have thought of, even though I am already using a few levels of pattern matching but nothing as complete as this. Let’s summarize the key steps to nailing this ugly beast once and for all:

( i ) Apply the few elimination regex from the onset to capture as much quality data as possible.
( ii ) Come up with a regex for each of the 12 components with sufficient testings.
( iii ) Use the matche() method and passing all 12 regexes to find the valid lines, there by rejecting / eliminate any remaining irrelevant ones.

There are a few minor clarifications that I need your input still:

Henry Wong wrote:
Create the super regex -- but don't do it by hand. Have the program build the super regex programmatically


Exactly which program are you referring to and where can I find it? Are you referring to regexbuilder available to try, or online regex tester? Would you recommend using either of these types of tools?


D. Ogranos wrote on line 5 of match():
5. pos = m.find(pos) ? m.end() : -1;

It tries to find a match for each pattern in the array, and makes sure to start with each part regex after the end of the last one.


I understand what you are trying to do but can’t quite get how it is achieved with this line. As you have correctly mentioned, match() doesn’t ensure the order of all 12 regexes compared to my original long winded erroneous mother of all regex. If so, I still need to do further checks to ensure the sequences of each component is correct. Any idea on how this limitation could be overcome apart from going back to my clunky illegible regex?

I am using regexbuddy but don’t appear to generate a regex when supplied it with test data. It is also very difficulty to understand where / how a regex breaks in debug which is ashamed. Interpreting the matching colour codes is not simple either.

Looks like there is a light at the end of tunnel and would be ecstatic to achieve a 95% success rate. It is only a matter of time with your help.

Many thanks to all again,
Jack
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Why doesn't my regex pickup this line