File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Matching words except those in tags with regex Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Matching words except those in tags with regex" Watch "Matching words except those in tags with regex" New topic
Author

Matching words except those in tags with regex

Bob Homes
Greenhorn

Joined: Jun 30, 2009
Posts: 5
Hello,

To begin, I'm not even sure regex is the best tool for this. I want to do a replace all match of all occurrences of a word, except when it occurs between < and >. I found out to use the /b for word boundry matches so it only replaces full word matches, but trying to find only those outside tags is difficult. I tried to implement some sort of lookahead and look behind scheme but it failed.

So if the string is "<Hello There World>Hello There World", and I want to replace "There" with "bob"; the final string would be "<Hello There World>Hello bob World"

Is there a simple way to do this?
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18977
    
  40

To begin, I'm not even sure regex is the best tool for this.


It is probably not -- but it is doable. For one level tags -- no nesting, is should be straight forward. For two levels, it is still doable. For three or more levels, it gets even harder, and is probably not worth it.

I tried to implement some sort of lookahead and look behind scheme but it failed.


For one level, lookahead should work. Just search for the word, followed by a lookahead of anything but the close tag (zero or more), followed by (still part of lookahead) either the open tag, or end of input.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Bob Homes
Greenhorn

Joined: Jun 30, 2009
Posts: 5
Here is what I tried, hopefully per your instructions:



Expected output is <hello world lo hel>hello world bob hel
Actual output is <hello world bob hel> hello world bob hel

Most likely it is my regex syntax. I had never heard of look ahead and look behind until earlier today.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18977
    
  40

 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Matching words except those in tags with regex