I don't know what is going on here. I have an application with this code to count the number of "the" in a text file. I bring in the whole text file as a StringArrayList and check for a space or beginning of a line to identify a "the" or "The".
Here is the code:
I feed it this String:
The skunk sat on the stump.
I just hit a mother load!
Some mothers sell handmade quilts and others sell chandeliers.
Go down to the beach and build a sand castle!
You said the man in the field was Anderson didn't you?
How did you like the play?
You can be a theologian if you study hard.
Theocracy is a word.
Have you seen Thelma and Louise?
I am the great and powerful The!
How hard could the ball be thrown?
What is the time for all men to come to realize that they need a good woman?
Is Raytheon the aircraft company of the future?
It finds 16 instances of "the" and finds it in the middle of words (which it shouldn't) and on the 6th line, it completely misses a "the".
I attached a screenshot of the words it finds highlighted in yellow. I don't know why it behaves like this.
Winston Gutkowski wrote:
The fact is, it shouldn't matter.
I suspect, however, that the main problem is that you're overthinking this: you don't need all those complex look-behinds; just a boundary matcher, viz:
String regex = "\b([Tt]he)\b";
Well, it still should be matching all the "the" and not skipping 2 in the middle of the string. Also your regex gives me similar flawed results. If I just try something simple like matching "\\s[Tt]he\\s" is works and doesn't miss anything. Very perplexing.
I found the problem. I was using a find for the highlight and it was of course finding the next "the" regardless of where it was. It works for matching all "the" no matter where it is. I have not worked on my coding skills in a very long time and was working on old code and forgot what I had written before did. lol