• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

regex pattern to exclude certain substrings from matches

 
Robert Kirkpatrick
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello

Does anyone know how to write a regex pattern that will match certain strings as long as they don't contain a certain substring?

For example how can you get all words from the String below between commas as long as they don't contain the substring "BAD" through a regex



...such that the matches returned will be:


"carBADrot" should not match.

Note that this is a very simplified example of what I need to do.

Thanks in advance!!!

Rob
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Use negative lookahead:
But be warned that lookaheads are slippery; you have to make sure they don't look too far ahead. For example, if I had used (?!.*?BAD) in the regex above, it would have failed to match "apple" and "banana" because the lookahead was seeing the BAD in "carBADrot".
 
Robert Kirkpatrick
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks Alan for such a fast reply!

Unfortunately, as you say, the negative lookahead is "slippery" and will not work. Is there any way to prevent the regex engine looking, as you put it, "too far ahead"? In other words, is it possible for the engine only to consider just the criteria of the pattern and not the entire string being examined?

Imagine if there were "X"s instead of commas



The following pattern (just as you predict) doesn't produce the desired results :


It's hard to believe that there isn't an economical and elegant way to do this via the java.regex package. It would be great if you or someone else could prove that there is a way. Thanks again!

Rob
 
Alan Moore
Ranch Hand
Posts: 262
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
For the record, the regex that I actually used, "\\b(?!\\w*?BAD)\\w+\\b", works with the input in your first example. This is because "\\w" won't match a comma, so the lookahead can't see past the next delimiter. That won't work with your second example, since the delimiter is a word character, but the principle is the same: make sure the lookahead doesn't look past the next delimiter, and that the matched text is preceded and followed by a delimiter. Here's a more general approach that will work for your second example:
This regex should work for any data with a single-character delimiter--just insert the real delimiter in place of each 'X' (and the string you want to exclude in place of "BAD"). In particular cases you may be able to use your knowledge of the data to express it more economically, as I did with your first example, but I'm afraid elegant is out of the question.
 
Stefan Wagner
Ranch Hand
Posts: 1923
Linux Postgres Database Scala
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why not just write:
 
Robert Kirkpatrick
Greenhorn
Posts: 3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Months on, and having browsed the web extensively and read Mehran Habibi's java regular expressions, I still haven't found a solution.

The problem is this: how to match a Pattern that doesn't contain a specified substring.

If it was one character (x, for example), there would be no problem because you can use but there seems to be no way or workaround to specify a sequence of characters to exclude. Shame!

Rob
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Ah, it really looks like Alan's posts answered your question. For some reason you replaced (?!\\w*?.*?BAD) with (?!.*?BAD), and not surprisingly it didn't work. However you don't seem to have explained what's wrong with the code Alan actually posted.

And I also like Stefan's response. Very often a problem which is hard or impossible as a single regular expression is much simpler using two or more regular expressions and a bit of Java code to tie them together. If you really must have a single regex for this, use negative lookahead as Alan suggests. But otherwise, something like what Stefan suggests will probably be easier to understand and debug.
[ December 14, 2005: Message edited by: Jim Yingst ]
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic