aspose file tools*
The moose likes Java in General and the fly likes Regular expression Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regular expression" Watch "Regular expression" New topic
Author

Regular expression

Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
I have a quit an unusual problem in parsing text from HTML document, using regular expression.

I try to extract Zip Code from the tex using the pattern - \d{3}\s?\d{3}

Some times the character in between the ZipCode (555 333) is an ASCII character with value greateer than 127.

So how can i change my regular expression to overcome this.

Searching the text for ASCII character with values greater than 127 will be costly.

David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
Can you simplify and change "\s" to "[^0-9]" (any non-digit)? This would accept all these as zip codes as "123456":What exactly do you want to consider whitespace? That's your key question.
[ April 14, 2005: Message edited by: David Harkness ]
Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
Thanks David for rescuing me again...

Well the motive is to have a set of regular patterns for thing like Phone Numbers, Mobile numbers, email, Zipcode and even a verification of a name which are all present in the document.

When viewed as html, the words are crealy seperated by space. But in reality these are sometimes characters with ASCII value greater than 127. So if I use \s in these cases the RegEx will fail.

Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
If it's HTML, and it looks like a space but has a value greater than 127, it's probably a non-breaking space (value 160, or 0xA0).

Try this: \d{3}[\s\xA0]?\d{3}
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Regular expression
 
Similar Threads
Unicode conversion
doubt on char assignment
how to check non english string
Reg Expression to Search for Non-matching Pattern
character coding problem