GeeCON Prague 2014*
The moose likes Java in General and the fly likes How to mask string not conforming to a regular expression pattern Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Java in General
Bookmark "How to mask string not conforming to a regular expression pattern" Watch "How to mask string not conforming to a regular expression pattern" New topic
Author

How to mask string not conforming to a regular expression pattern

Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi All,

I would like your advice on how to mask out / ignore a string that does not match a well established working regular expression pattern in Java. Below is the code snippet that matches all the lines with correct regular expression string except one found so far:


The difference between correctPropertyDetail and incorrectPropertyDetail is the ‘S’ after Rose St. A sample of few hundred lines of data has been picked up properly but a few incorrect ones managed to slipped through. Neither pattern1 nor 2 achieve the desired objective.

I am running JDK1.6.0_25, Netbeans 7.0 on Windows XP.

Your assistance would be appreciated.

Thanks,

Jack
James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781

Jack Bush wrote: Hi All,

I would like your advice on how to mask out / ignore a string that does not match a well established working regular expression pattern in Java.

Working? But it fails your test data!

Below is the code snippet that matches all the lines with correct regular expression string except one found so far:




There is something very wrong with pattern1! The implication of the regex fragment "St|Rd|Av|Sq|Cl|Pl|Cr|Gr|Dr|La" is that you want to allow only 'St' or 'Rd' or 'Av' etc BUT BUT BUT for the '|' to apply you would need to enclose it as a group i.e. "(?:St|Rd|Av|Sq|Cl|Pl|Cr|Gr|Dr|La)".

So your "well established working regular expression" is far from working.


The difference between correctPropertyDetail and incorrectPropertyDetail is the ‘S’ after Rose St. A sample of few hundred lines of data has been picked up properly but a few incorrect ones managed to slipped through. Neither pattern1 nor 2 achieve the desired objective.

I am running JDK1.6.0_25, Netbeans 7.0 on Windows XP.

Your assistance would be appreciated.

Thanks,

Jack


Now I could go though this regex and try to work out what it is meant to do but without a specification I would just be guessing.

In your position I would get the guy who wrote the "well established working regular expression" to explain the missing braces!

Edit: Another obvious bug - the regex sub-term " [h|u|t]" looks to want to allow only an 'h' or a 'u' or a 't' but it will also allow '|' since '|' has no special meaning inside that character class. I would guess that the author wanted either "[hut]" or "(?:h|u|t)" . Whatever the guy was who wrote that regex, he wasn't competent at regular expressions!

In my opinion, the only way you are going to get this regular expression working is to spend time writing a semi-formal specification as to what it should accept. A BNF specification is usually best. It is then fairly easy to turn the BNF into a regular expression using the BNF parts as comments.
Jack Bush
Ranch Hand

Joined: Oct 20, 2006
Posts: 235
Hi James,

Thank you for offering your valuable advice. I have made the suggested changes but it still picked up the same incorrect string with an extra 'S'. As a result, I have applied the same suggestion throughout the rest of the regular expression which successfully prevented the incorrect string from coming through (great!), but also stopped many correct ones from being accepted. Below is the newly modified regular expression used?



Note that it is the second sub-pattern “(?:[A-Z]?[0-9]{0,4}/?[0-9]{0,4}-?[0-9]{0,4}|[0-9]{0,4}[a-z])” that is responsible for making it work, yet also breaking the regular expression by not accepting the good strings as well. Some of the good strings include the example provided earlier. Can you see what is wrong with it?

I no longer have confident in myself who was responsible for coming up with a half baked, ordinary regular expression, largely due to a lack of familiarity with this subject. Instead, you are the competent person who can help me get just this one pattern working. I am a novice amateur programmer and do not have resource and time to come up with a BNF specification, even though that would be an ideal situation especially in large project.

Again, your advice would be appreciated.

Thanks,

Jack
James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781

Jack Bush wrote: Hi James,

Thank you for offering your valuable advice. I have made the suggested changes


I did not suggest changes; I pointed out two major flaws in the regex. Fixing these flaws in isolation was never going to solve the problem you are having.


but it still picked up the same incorrect string with an extra 'S'. As a result, I have applied the same suggestion throughout the rest of the regular expression which successfully prevented the incorrect string from coming through (great!), but also stopped many correct ones from being accepted. Below is the newly modified regular expression used?



Note that it is the second sub-pattern “(?:[A-Z]?[0-9]{0,4}/?[0-9]{0,4}-?[0-9]{0,4}|[0-9]{0,4}[a-z])” that is responsible for making it work, yet also breaking the regular expression by not accepting the good strings as well. Some of the good strings include the example provided earlier. Can you see what is wrong with it?


As I said in my first response - since I don't have a specification to work against I would only be guessing.


I no longer have confident in myself who was responsible for coming up with a half baked, ordinary regular expression, largely due to a lack of familiarity with this subject. Instead, you are the competent person who can help me get just this one pattern working.


Flattery will get you most places but without a specification of some sort I am stuffed.


I am a novice amateur programmer and do not have resource and time to come up with a BNF specification, even though that would be an ideal situation especially in large project.


I know you won't like to hear this but you have to make the time to come up with some form of specification. If you are a novice amateur programmer then what is the time constraint that stops you coming up with a specification ?


Again, your advice would be appreciated.

Thank

Jack


I like regular expressions and before I retired I used them on many many projects. I still use them on my personal projects but unless they are trivial I always write a specification. Once you have a specification you can incorporate it as comments into your regular expression. Build the regex as a concatenation of commented short sub-strings; this will allow you to debug it later if there is a problem. For example, this is a fragment from the Java that generates my email address validator regex -


As one long string without the comment this would be impossible to maintain.

Finally, write a set of unit tests that check a number of good input and a number of bad inputs. Make sure you include the edge cases.
 
 
subject: How to mask string not conforming to a regular expression pattern