Hi I'm trying to concoct a regular expression that does the following: given a data set, I want it to retrieve the words that are not in another given set of words: like, if i have the input: all great minds think alike and the constraint : (great|think) the output would be: all minds alike
so sort of gimme * but not if * = great or think
thanks
Alan Moore
Ranch Hand
Joined: May 06, 2004
Posts: 262
posted
0
All the "\\b"s are there to make sure you're only comparing whole words to whole words; lookaheads are slippery that way.
Tad Dicks
Ranch Hand
Joined: Nov 16, 2004
Posts: 264
posted
0
"\\b(?!(?:great|think)\\b)\\w+\\b"
what does the "?:" mean?
Does that mean remove? if it the pipe was changed to a comma between great and think, what effect would that have on the regex.
Alan Moore
Ranch Hand
Joined: May 06, 2004
Posts: 262
posted
0
The "?:" means the enclosing parentheses form a non-capturing group. It's good practice to always use non-capturing parentheses for grouping if you don't actually need to capture that part of the match.
If you replaced the pipe with a comma, the lookahead would match the literal sequence "great,think" instead of "great" OR "think"; the only place a comma has special meaning is in the "{m,n}" quantifier.
If you replaced the pipe with a comma, the lookahead would match the literal sequence "great,think" instead of "great" OR "think"; the only place a comma has special meaning is in the "{m,n}" quantifier.
I've been staring at too many dtds, that use a lot of regex-like syntax and was thinking the comma might be akin to an and (like a sequence in element declaration vs the pipes being or in a choice).
-Tad
Layne Lund
Ranch Hand
Joined: Dec 06, 2001
Posts: 3061
posted
0
This might be easier to do without regexes, depending on what the purpose is. In particular, every class that implements the Collection interface from the Collections framework has removeAll() and retainAll() that act like the mathimatical set difference and set union operations. You could add each "word" to a Collection of your choice (perhaps a Set?) and use these operations to get a Collection with the words you want.
WOW, thats such a refreshingly simple solution. SPectacular I must say, but doesn't it trade off on the efficiency? Collections would be more memory hogging than just long string arrays?
"It's not enough that we do our best; sometimes we have to do<br />what's required."<br /> <br />-- Sir Winston Churchill
Layne Lund
Ranch Hand
Joined: Dec 06, 2001
Posts: 3061
posted
0
I think you will have to implement both approaches and measure how much memory overhead there is for using a Collection over an array of Strings. I think the overhead will be negligible. If implemented correctly, I think my idea will be much easier to understand and maintain, which outweighs the costs for the extra memory overhead.
In addition, you need to consider what the purpose for this is. At least, I assume that this is a small part of a larger project. Which approach will provide a data structure that other code can interface with more easily? Since you haven't provided much in the way of context, I can't even provide a suggestion along these lines. Even if I did, this boils down to a design decision on your part.
Layne
Akshay Kiran
Ranch Hand
Joined: Aug 18, 2005
Posts: 220
posted
0
The problem at hand wasn't mine, so I shall not be able to speak much either. But imagine if it were a 1000 words in a list, and a String of 1000 words in consideration, how would the compiler go about implementing the two approaches? On grounds of readability, certainly yes, your approach would be far better than regex... the only point of moot maybe "will there be a objectionable memory overhead? and if yes, is it worth the trade off?" i think the questions will be best answered by those who dirty their hands in such stuff. I can't keep my hands clean here and provide answers! [ October 24, 2005: Message edited by: Akshay Kiran ]