aspose file tools*
The moose likes Java in General and the fly likes regex postive lookbehind Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "regex postive lookbehind" Watch "regex postive lookbehind" New topic
Author

regex postive lookbehind

Mack Wilmot
Ranch Hand

Joined: Jul 27, 2011
Posts: 88

I don't know what is going on here. I have an application with this code to count the number of "the" in a text file. I bring in the whole text file as a String ArrayList and check for a space or beginning of a line to identify a "the" or "The".

Here is the code:



I feed it this String:


The skunk sat on the stump.
I just hit a mother load!
Some mothers sell handmade quilts and others sell chandeliers.
Go down to the beach and build a sand castle!
You said the man in the field was Anderson didn't you?
How did you like the play?
You can be a theologian if you study hard.
Theocracy is a word.
Have you seen Thelma and Louise?
I am the great and powerful The!
How hard could the ball be thrown?
What is the time for all men to come to realize that they need a good woman?
Is Raytheon the aircraft company of the future?


It finds 16 instances of "the" and finds it in the middle of words (which it shouldn't) and on the 6th line, it completely misses a "the".

I attached a screenshot of the words it finds highlighted in yellow. I don't know why it behaves like this.





[Thumbnail for words.png]

Darryl Burke
Bartender

Joined: May 03, 2008
Posts: 4658
    
    5

That's not what I get, using your regex. Of course, words like Theologian, they etc are also matched -- to prevent that you need a look-ahead for a space or end-of-input/line.

Prints:


luck, db
There are no new questions, but there may be new answers.
Mack Wilmot
Ranch Hand

Joined: Jul 27, 2011
Posts: 88

Darryl, what version of JDK are you using? I am using v7u10 64bit.

Thanks!

EDIT: Well never mind I just used your code and it works... maybe it has something to do with me running it in a new thread or something...
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8196
    
  23

Mack Wilmot wrote:Darryl, what version of JDK are you using? I am using v7u10 64bit.

The fact is, it shouldn't matter.

I suspect, however, that the main problem is that you're overthinking this: you don't need all those complex look-behinds; just a boundary matcher, viz:
String regex = "\b([Tt]he)\b";

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Mack Wilmot
Ranch Hand

Joined: Jul 27, 2011
Posts: 88

Winston Gutkowski wrote:
The fact is, it shouldn't matter.

I suspect, however, that the main problem is that you're overthinking this: you don't need all those complex look-behinds; just a boundary matcher, viz:
String regex = "\b([Tt]he)\b";

Winston


Well, it still should be matching all the "the" and not skipping 2 in the middle of the string. Also your regex gives me similar flawed results. If I just try something simple like matching "\\s[Tt]he\\s" is works and doesn't miss anything. Very perplexing.

Thanks!
Tony Docherty
Bartender

Joined: Aug 07, 2007
Posts: 2364
    
  50
Well, it still should be matching all the "the" and not skipping 2 in the middle of the string.

The regex works for me, which two is it skipping?

If I just try something simple like matching "\\s[Tt]he\\s" is works

Are you sure. It doesn't work for me, it correctly fails to match the very first The and the one followed by a '!'
Mack Wilmot
Ranch Hand

Joined: Jul 27, 2011
Posts: 88

I found the problem. I was using a find for the highlight and it was of course finding the next "the" regardless of where it was. It works for matching all "the" no matter where it is. I have not worked on my coding skills in a very long time and was working on old code and forgot what I had written before did. lol

EDIT: Thanks Tony!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: regex postive lookbehind