This week's book giveaway is in the Servlets forum.
We're giving away four copies of Murach's Java Servlets and JSP and have Joel Murach on-line!
See this thread for details.
The moose likes Other Languages and the fly likes Regular Expression: ignore html Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Languages » Other Languages
Bookmark "Regular Expression: ignore html" Watch "Regular Expression: ignore html" New topic
Author

Regular Expression: ignore html

Nits Kulkarni
Greenhorn

Joined: Mar 23, 2006
Posts: 8
Hi,

I am doing a find and replace for the text inside my html. I am trying to find a regular expression which can do that for me.


Example Html string: "Welcome to furniture house. <table>How big is the dining table?</table>. Probably you would want one."

Here, I want to replace text "table" to "<strong>table</strong>", but not inside the html tag but only text "table". how can i do it with regex?


Thanks in advance.
Nitin
Andy Bach
Greenhorn

Joined: Feb 02, 2005
Posts: 4
> Here, I want to replace text "table" to "<strong>table</strong>", but not inside the html tag but only text "table". how can i do it with regex?

Hmm, negative character class might work, esp. if you're asking for a specific word like table (perl-ish):
s#[^<]/?table\b#<strong>table</strong>#g

any char not a left pointy, zero or one backslash (to handle both start and end tags), "table", a word border (so's not to match "tabletennis" - I used "#" to avoid the leaning toothpick syndrome.

But the general answer is - don't try to parse html by hand, get a module/util to do it. It very, very, very quickly becomes very, very hard to cover all the possibilities by hand.


Hi Mom!
Hauke Ingmar Schmidt
Rancher

Joined: Nov 18, 2008
Posts: 433
    
    2
Andy Bach wrote:But the general answer is - don't try to parse html by hand, get a module/util to do it. It very, very, very quickly becomes very, very hard to cover all the possibilities by hand.


And it is logically impossible with regex alone. Look at the Chomky hierarchy of grammars to see why.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regular Expression: ignore html
 
Similar Threads
a regular expression search and replace program
questions regarding regular expressions
Need help on Regular Expression
how to do in regular expressions
JSP and XML