wood burning stoves*
The moose likes Java in General and the fly likes RegEx to negate a set of words Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "RegEx to negate a set of words" Watch "RegEx to negate a set of words" New topic
Author

RegEx to negate a set of words

Bucsie Dusca
Ranch Hand

Joined: Oct 18, 2004
Posts: 31
Hi
I'm trying to concoct a regular expression that does the following:
given a data set, I want it to retrieve the words that are not in another given set of words:
like, if i have the input:
all
great
minds
think
alike
and the constraint : (great|think)
the output would be:
all
minds
alike


so sort of gimme * but not if * = great or think

thanks
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262

All the "\\b"s are there to make sure you're only comparing whole words to whole words; lookaheads are slippery that way.
Tad Dicks
Ranch Hand

Joined: Nov 16, 2004
Posts: 264
"\\b(?!(?:great|think)\\b)\\w+\\b"


what does the "?:" mean?

Does that mean remove?
if it the pipe was changed to a comma between great and think, what effect would that have on the regex.
Alan Moore
Ranch Hand

Joined: May 06, 2004
Posts: 262
The "?:" means the enclosing parentheses form a non-capturing group. It's good practice to always use non-capturing parentheses for grouping if you don't actually need to capture that part of the match.

If you replaced the pipe with a comma, the lookahead would match the literal sequence "great,think" instead of "great" OR "think"; the only place a comma has special meaning is in the "{m,n}" quantifier.


http://www.regular-expressions.info/tutorial.html
Tad Dicks
Ranch Hand

Joined: Nov 16, 2004
Posts: 264
If you replaced the pipe with a comma, the lookahead would match the literal sequence "great,think" instead of "great" OR "think"; the only place a comma has special meaning is in the "{m,n}" quantifier.


I've been staring at too many dtds, that use a lot of regex-like syntax and was thinking the comma might be akin to an and (like a sequence in element declaration vs the pipes being or in a choice).

-Tad
Layne Lund
Ranch Hand

Joined: Dec 06, 2001
Posts: 3061
This might be easier to do without regexes, depending on what the purpose is. In particular, every class that implements the Collection interface from the Collections framework has removeAll() and retainAll() that act like the mathimatical set difference and set union operations. You could add each "word" to a Collection of your choice (perhaps a Set?) and use these operations to get a Collection with the words you want.

Let me know what you think.

Layne


Java API Documentation
The Java Tutorial
Akshay Kiran
Ranch Hand

Joined: Aug 18, 2005
Posts: 220
WOW, thats such a refreshingly simple solution.
SPectacular I must say, but doesn't it trade off on the efficiency?
Collections would be more memory hogging than just long string arrays?


"It's not enough that we do our best; sometimes we have to do<br />what's required."<br /> <br />-- Sir Winston Churchill
Layne Lund
Ranch Hand

Joined: Dec 06, 2001
Posts: 3061
I think you will have to implement both approaches and measure how much memory overhead there is for using a Collection over an array of Strings. I think the overhead will be negligible. If implemented correctly, I think my idea will be much easier to understand and maintain, which outweighs the costs for the extra memory overhead.

In addition, you need to consider what the purpose for this is. At least, I assume that this is a small part of a larger project. Which approach will provide a data structure that other code can interface with more easily? Since you haven't provided much in the way of context, I can't even provide a suggestion along these lines. Even if I did, this boils down to a design decision on your part.

Layne
Akshay Kiran
Ranch Hand

Joined: Aug 18, 2005
Posts: 220
The problem at hand wasn't mine, so I shall not be able to speak much either.
But imagine
if it were a 1000 words in a list, and a String of 1000 words in consideration, how would the compiler go about implementing the two approaches?
On grounds of readability, certainly yes, your approach would be far better than regex...
the only point of moot maybe "will there be a objectionable memory overhead? and if yes, is it worth the trade off?"
i think the questions will be best answered by those who dirty their hands in such stuff. I can't keep my hands clean here and provide answers!
[ October 24, 2005: Message edited by: Akshay Kiran ]
 
 
subject: RegEx to negate a set of words