aspose file tools*
The moose likes Java in General and the fly likes regex pattern to exclude certain substrings from matches Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "regex pattern to exclude certain substrings from matches" Watch "regex pattern to exclude certain substrings from matches" New topic
Author

regex pattern to exclude certain substrings from matches

Robert Kirkpatrick
Greenhorn

Joined: Jul 24, 2005
Posts: 3
Hello

Does anyone know how to write a regex pattern that will match certain strings as long as they don't contain a certain substring?

For example how can you get all words from the String below between commas as long as they don't contain the substring "BAD" through a regex



...such that the matches returned will be:


"carBADrot" should not match.

Note that this is a very simplified example of what I need to do.

Thanks in advance!!!

Rob
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
Use negative lookahead:
But be warned that lookaheads are slippery; you have to make sure they don't look too far ahead. For example, if I had used (?!.*?BAD) in the regex above, it would have failed to match "apple" and "banana" because the lookahead was seeing the BAD in "carBADrot".
Robert Kirkpatrick
Greenhorn

Joined: Jul 24, 2005
Posts: 3
Thanks Alan for such a fast reply!

Unfortunately, as you say, the negative lookahead is "slippery" and will not work. Is there any way to prevent the regex engine looking, as you put it, "too far ahead"? In other words, is it possible for the engine only to consider just the criteria of the pattern and not the entire string being examined?

Imagine if there were "X"s instead of commas



The following pattern (just as you predict) doesn't produce the desired results :


It's hard to believe that there isn't an economical and elegant way to do this via the java.regex package. It would be great if you or someone else could prove that there is a way. Thanks again!

Rob
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
For the record, the regex that I actually used, "\\b(?!\\w*?BAD)\\w+\\b", works with the input in your first example. This is because "\\w" won't match a comma, so the lookahead can't see past the next delimiter. That won't work with your second example, since the delimiter is a word character, but the principle is the same: make sure the lookahead doesn't look past the next delimiter, and that the matched text is preceded and followed by a delimiter. Here's a more general approach that will work for your second example:
This regex should work for any data with a single-character delimiter--just insert the real delimiter in place of each 'X' (and the string you want to exclude in place of "BAD"). In particular cases you may be able to use your knowledge of the data to express it more economically, as I did with your first example, but I'm afraid elegant is out of the question.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Why not just write:


http://home.arcor.de/hirnstrom/bewerbung
Robert Kirkpatrick
Greenhorn

Joined: Jul 24, 2005
Posts: 3
Months on, and having browsed the web extensively and read Mehran Habibi's java regular expressions, I still haven't found a solution.

The problem is this: how to match a Pattern that doesn't contain a specified substring.

If it was one character (x, for example), there would be no problem because you can use but there seems to be no way or workaround to specify a sequence of characters to exclude. Shame!

Rob
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Ah, it really looks like Alan's posts answered your question. For some reason you replaced (?!\\w*?.*?BAD) with (?!.*?BAD), and not surprisingly it didn't work. However you don't seem to have explained what's wrong with the code Alan actually posted.

And I also like Stefan's response. Very often a problem which is hard or impossible as a single regular expression is much simpler using two or more regular expressions and a bit of Java code to tie them together. If you really must have a single regex for this, use negative lookahead as Alan suggests. But otherwise, something like what Stefan suggests will probably be easier to understand and debug.
[ December 14, 2005: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, Twister
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: regex pattern to exclude certain substrings from matches