Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
The moose likes Java in General and the fly likes Regular expression Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regular expression" Watch "Regular expression" New topic

Regular expression

Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
I have a quit an unusual problem in parsing text from HTML document, using regular expression.

I try to extract Zip Code from the tex using the pattern - \d{3}\s?\d{3}

Some times the character in between the ZipCode (555 333) is an ASCII character with value greateer than 127.

So how can i change my regular expression to overcome this.

Searching the text for ASCII character with values greater than 127 will be costly.

David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
Can you simplify and change "\s" to "[^0-9]" (any non-digit)? This would accept all these as zip codes as "123456":What exactly do you want to consider whitespace? That's your key question.
[ April 14, 2005: Message edited by: David Harkness ]
Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
Thanks David for rescuing me again...

Well the motive is to have a set of regular patterns for thing like Phone Numbers, Mobile numbers, email, Zipcode and even a verification of a name which are all present in the document.

When viewed as html, the words are crealy seperated by space. But in reality these are sometimes characters with ASCII value greater than 127. So if I use \s in these cases the RegEx will fail.

Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
If it's HTML, and it looks like a space but has a value greater than 127, it's probably a non-breaking space (value 160, or 0xA0).

Try this: \d{3}[\s\xA0]?\d{3}
I agree. Here's the link:
subject: Regular expression
It's not a secret anymore!