wood burning stoves 2.0*
The moose likes Java in General and the fly likes Regex: Keep Single dashes Negative Lookahead Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regex: Keep Single dashes Negative Lookahead" Watch "Regex: Keep Single dashes Negative Lookahead" New topic
Author

Regex: Keep Single dashes Negative Lookahead

Bill Hogsett
Greenhorn

Joined: Oct 17, 2011
Posts: 9
I am breaking a text file into words for processing. But I want to treat as a single word words that contain a single dash.

For example "Oh-wow" should be one word and "But--not--this" should be three words. The double dashes could be at the start, middle or end of a word.

I think I need to use negative look ahead, but am not sure of that and am not sure how to do it.

My current pattern is:



But it does not work.

My normal test file is the Gutenberg Project's Moby Dick.txt.

Any suggestions?

Thanks.

Bill Hogsett
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3616
    
  14

I haven't tested this, and I don't use regexes that often, but maybe this will give you an idea:
This pattern essentially says: Match anything that starts with at least one letter, followed by zero or more groups that start with a dash and at least one letter.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7716
    
  20

Bill Hogsett wrote:Any suggestions?

Yes. Don't try to do it all with regexes (or at least not all at once).

I agree with Harsha that String.split() is probably what you want initially, although I think I'd probably go with
sentence.split("\\s+")
myself.

That splits your text into whitespace-delimited "words". Once you have those, then decide what a word really is.

You might even want to return the words as a List, so that you can split up existing ones if need be. For example:
Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Harsha Smith
Ranch Hand

Joined: Jul 18, 2011
Posts: 287
That was my 100th post. Hope OP and others find it useful.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7716
    
  20

Harsha Smith wrote:That was my 100th post.

Congrats.

Winston
Bill Hogsett
Greenhorn

Joined: Oct 17, 2011
Posts: 9
Thanks Harsha! 100 posts. That is great. Having communities like this really helps.

Here is what I am using now:



and then I call:




}

Using your //s for the original split didn't strip out punctuation.

Is there a better way to do the first split?

Thanks.

Bill
Bill Hogsett
Greenhorn

Joined: Oct 17, 2011
Posts: 9
Harsha, I now have one (I hope) problem.

In the following from Moby Dick, I end up with a word "-Westers".

" So that Monsoons, Pampas, Nor'-Westers,
Harmattans, Trades; any wind but the Levanter and Simoon, might
blow Moby Dick into the devious zig-zag world-circle of the Pequod's
circumnavigating wake."

And here I get "-wester":

"Here comes another with a sou'-wester and a bombazine cloak."

While westers and wester are not common words, I would like to treat them as words and get rid of the leading dash. But since I may be handling large documents I don't want to slow the split down much.

Bill
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7716
    
  20

Bill Hogsett wrote:While westers and wester are not common words, I would like to treat them as words and get rid of the leading dash. But since I may be handling large documents I don't want to slow the split down much.

I honestly wouldn't worry about it. What is Moby Dick: 100,000 words? Any loop will process that in a split-second. It's far more likely that your delay will be with I/O.

Winston
Harsha Smith
Ranch Hand

Joined: Jul 18, 2011
Posts: 287
single regex to answer all your questions
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38513
    
  23
And that even keeps “zig-zag” as one word.
Bill Hogsett
Greenhorn

Joined: Oct 17, 2011
Posts: 9
Harsha Smith wrote:single regex to answer all your questions


Thasnks Harsha, that got me closer, but missed a few characters (e.g., ' "_). I am now using:

"([\\[_\"()*#.,?!:;]|\\s|'\\-|\\-\\-|'|'\\-\\-)"

My program reports two words that I cannot understand. they are:

-and 2
-when 1

The numbers are the number of usages in Moby Dick. Looking at the document and searching for --and|when I don't see any pattern that would get those results. Melville used "--" preceded by a character (e.g., '-- :-- ;-- !--) but none of them seem to show me a pattern to use or to get those results.

Any suggestions? (I can live with what I have, but ...)

One last question. Your code did not output anything when I ran it in NetBeans. I had to do String[] wordarr = words.split(regex) before the loop and then use wordarr in the loop. Does that make sense to you? I haven't tested outside of NetBeans. Running 1.6 in NetBeans.

Thanks again.

Bill
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7716
    
  20

Bill Hogsett wrote:Thasnks Harsha, that got me closer, but missed a few characters (e.g., ' "_). I am now using:

"([\\[_\"()*#.,?!:;]|\\s|'\\-|\\-\\-|'|'\\-\\-)"

My program reports two words that I cannot understand. they are:

-and 2
-when 1

I'll say it one more time, just in case you missed it earlier: trying to do all this with a single regex is likely to be:
(a) time-consuming
(b) error-prone
(c) result in code (or at least an expression) that is hard for anyone else to decipher and/or change if they need to.
and I say this as a 15-year Unix System Administrator, so I love regex.

If you did as was suggested earlier and break down the problem into 2 parts:
1. Get your whitespace-delimited words.
2. Check each word for a valid pattern.
I suspect you'll have a far more flexible solution.

Just one of the things you would then be able to do is to print out the actual word (or words) that contains "-and", along with some indication of where it was found.

Winston
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3616
    
  14

I agree with Winston about using single regexes, but here is how you could do it using a scanner (yes, using a single regex, sorry):
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7716
    
  20

Stephan van Hulst wrote:...but here is how you could do it using a scanner (yes, using a single regex, sorry)

Hey, no worries about a single regex, providing it's not too arcane. I quite like yours actually.

However, another thought struck me: so far most posts have been concentrating on "getting the delimiter pattern right". If you simply eliminate the whitespace, you could instead concentrate on getting the "word pattern" right. I have no idea whether it's any easier, but it appeals, simply because you're looking for something that's 'correct', rather than trying to eliminate something that's incorrect.

Winston
Harsha Smith
Ranch Hand

Joined: Jul 18, 2011
Posts: 287
Can you specify us all the requirements and explain us in detail with examples how you want the words to be split? One of us will definitely provide you a very good Regex pattern based on the spec.

Please include big sample text .

Harsha Smith
Ranch Hand

Joined: Jul 18, 2011
Posts: 287
Tell us if this helps
[edit]Add newlines to make post easier to read. Please avoid long lines in code tags.[/edit]
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7716
    
  20

Harsha Smith wrote:Can you specify us all the requirements and explain us in detail with examples how you want the words to be split?

Another wrinkle for you (assuming this is English): possessives can sometimes end with an apostrophe, eg "The farmers' fields".

Winston
Bill Hogsett
Greenhorn

Joined: Oct 17, 2011
Posts: 9
Stephan van Hulst wrote:I agree with Winston about using single regexes, but here is how you could do it using a scanner (yes, using a single regex, sorry):


Thanks, but I cannot get this to compile. The error is:

Exception in thread "main" java.util.regex.PatternSyntaxException: Unknown character property name {Alphabetic} near index 23
-{2,}|[^\p{IsAlphabetic}'-]+

Bill
Bill Hogsett
Greenhorn

Joined: Oct 17, 2011
Posts: 9
Harsha Smith wrote:Can you specify us all the requirements and explain us in detail with examples how you want the words to be split? One of us will definitely provide you a very good Regex pattern based on the spec.

Please include big sample text .



Specification? I don't need no specification! In the words of a U.S. Supreme Court Justice on another topic, "I know it when I see it."

Seriously, I want to parse a text file and return words that English speakers would normally identify as words. So here are some examples:

afterwards, he smoked. as three words with no punctuation So remove punctuation.
don't and other contractions are maintained (I do not think this is handled currently.)
Oh-my-gosh as one word
He--looking away--said stop as five words
killed!--a big whale--:Moby Dick as six words with no punctuation
Nor'--Wester Not sure here. Certainly Wester as a word, but let's go with Nor and not Nor' as a word

You asked for a big test file. I can't figure out uploading here. Both .txt and .zip filies are rejected. So, get Moby Dick here Moby Dick

Thanks to everyone who has made suggestions. I have not overlooked the suggestion to simplify the regex and do this in steps.

Bill

Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3616
    
  14

See, what English speakers would normally identify as words, that doesn't really compute, unless you incorporate a dictionary and some pretty complex code.

The code I gave you should handle most of your cases, except for words ending with an apostrophe. You will have to discard the apostrophe after you have scanned a token.

It's a pity the IsAlphabetic class doesn't work. Try with \\p{Alpha} instead.
Bill Hogsett
Greenhorn

Joined: Oct 17, 2011
Posts: 9
Stephan van Hulst wrote:See, what English speakers would normally identify as words, that doesn't really compute, unless you incorporate a dictionary and some pretty complex code.

The code I gave you should handle most of your cases, except for words ending with an apostrophe. You will have to discard the apostrophe after you have scanned a token.

It's a pity the IsAlphabetic class doesn't work. Try with \\p{Alpha} instead.


\\p{Alpha} works. Your pattern handles everything except apostrophes (both at the beginning and end of words). It nicely handles contractions. I can live with the apostrophe at the end. Can you suggest how to get the apostrophe from the beginning of words?

Thanks.

Bill

ps. My first uses suggest that using scanner for each test is slower than using scanner for each line and then using split with a compiled regex pattern.
Harsha Smith
Ranch Hand

Joined: Jul 18, 2011
Posts: 287
My suggestion is do the basic splitting using a simple regex. Then remove punctuation as shown in my code.

And Bill don't be angry with us


Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3616
    
  14

Don't worry, I don't think he is :P

Bill, you can easily remove the apostrophes with simple code. Just check if the char at index 0 is an apostrophe, and if it is, take the substring at index 1. I'm sure you can handle the case where there's an apostrophe at the end too.
Bill Hogsett
Greenhorn

Joined: Oct 17, 2011
Posts: 9
Stephan van Hulst wrote:Don't worry, I don't think he is :P

Bill, you can easily remove the apostrophes with simple code. Just check if the char at index 0 is an apostrophe, and if it is, take the substring at index 1. I'm sure you can handle the case where there's an apostrophe at the end too.


Thanks to everyone for the help. I have what I need and certainly can handle the apostrophe myself.

Harsha, I am not angry with you or anyone here. The forum has provided superlative assistance, code and advice to me.

I consider this closed, but will follow any future posts.

Bill
Harsha Smith
Ranch Hand

Joined: Jul 18, 2011
Posts: 287
Bill, Please do come up with more challenging issues often Won't you come to see us tomorrow? Have a nice day!
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7716
    
  20

Stephan van Hulst wrote:Bill, you can easily remove the apostrophes with simple code...

And don't forget that those stupid MS 'smart quotes' aren't apostophes, even though they look like 'em (there's a good word with an apostrophe in front for you). I suspect Stephan's regex'll handle them though.

And then there's always stuff like fo'c'sle (actually, more properly: fo'c's'le)...

Winston
 
wood burning stoves
 
subject: Regex: Keep Single dashes Negative Lookahead