Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regular expression

 
Pradeep Kadambar
Ranch Hand
Posts: 148
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have a quit an unusual problem in parsing text from HTML document, using regular expression.

I try to extract Zip Code from the tex using the pattern - \d{3}\s?\d{3}

Some times the character in between the ZipCode (555 333) is an ASCII character with value greateer than 127.

So how can i change my regular expression to overcome this.

Searching the text for ASCII character with values greater than 127 will be costly.

 
David Harkness
Ranch Hand
Posts: 1646
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you simplify and change "\s" to "[^0-9]" (any non-digit)? This would accept all these as zip codes as "123456":What exactly do you want to consider whitespace? That's your key question.
[ April 14, 2005: Message edited by: David Harkness ]
 
Pradeep Kadambar
Ranch Hand
Posts: 148
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks David for rescuing me again...

Well the motive is to have a set of regular patterns for thing like Phone Numbers, Mobile numbers, email, Zipcode and even a verification of a name which are all present in the document.

When viewed as html, the words are crealy seperated by space. But in reality these are sometimes characters with ASCII value greater than 127. So if I use \s in these cases the RegEx will fail.

 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If it's HTML, and it looks like a space but has a value greater than 127, it's probably a non-breaking space (value 160, or 0xA0).

Try this: \d{3}[\s\xA0]?\d{3}
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic