aspose file tools*
The moose likes Java in General and the fly likes Counting exact matches of substring. Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Counting exact matches of substring." Watch "Counting exact matches of substring." New topic
Author

Counting exact matches of substring.

Michael Boehm
Ranch Hand

Joined: Jun 02, 2010
Posts: 51
I have a string which contains words, numbers, line breaks, punctuations etc. all sort of characters.
I want to count the number of exact occurrences of some words in the string.

I am experimenting using the following code


I am trying to work out how the regular expression should look when I want exact matches, eg. given the text "foobar" and substring "foo" the count should be 0.
The regular expression

almost works for counting occurences of "foo", but not quite.
Anton Shaykin
Ranch Hand

Joined: Dec 13, 2009
Posts: 57

First of all, there is a nice overview of regular expressions in Java API for Pattern class (here). I even use it for reference, when working with regexps in other languages.
Also you may want to have a look at pretty good tutorial on regular expressions from Sun here.
So, before asking such questions, you could try to figure it out,by first, learning the basics about regular expressions.
Anyway, the correct pattern in your case would be:
^foo$
As you can find in documentation for Pattern class, ^ stands for the beginning of a line, and $ for the end.
jishnu dasgupta
Ranch Hand

Joined: Mar 11, 2011
Posts: 103

Hi Michael,

As Anton suggested you probably need to look into your regex expression. As your expression stands i beleive it would match for 1foo98 which is not what you want i guess.

On a personal note, if all you want is just to count the number of occurences you might just as well use the Scanner class.


If debugging is the process of removing bugs, then programming must be the process of putting them in. -- Edsger Dijkstra

Michael Boehm
Ranch Hand

Joined: Jun 02, 2010
Posts: 51
Anton Shaykin wrote:First of all, there is a nice overview of regular expressions in Java API for Pattern class (here). I even use it for reference, when working with regexps in other languages.
Also you may want to have a look at pretty good tutorial on regular expressions from Sun here.
So, before asking such questions, you could try to figure it out,by first, learning the basics about regular expressions.
Anyway, the correct pattern in your case would be:
^foo$
As you can find in documentation for Pattern class, ^ stands for the beginning of a line, and $ for the end.


I am familiar with the basics of regular expressions.
Have a look at the question again and see that ^foo$ is not the correct pattern in my case as I want to count "foo" for every time it appears as a word in a string.
The string might for instance be "baz23! foos23foo bar foobar barfoo!foo" and the count should be 2.
Michael Boehm
Ranch Hand

Joined: Jun 02, 2010
Posts: 51
jishnu dasgupta wrote:Hi Michael,

As Anton suggested you probably need to look into your regex expression. As your expression stands i beleive it would match for 1foo98 which is not what you want i guess.

On a personal note, if all you want is just to count the number of occurences you might just as well use the Scanner class.


I would want to count that as an occurence. Seems like \bfoo\b should work [EDIT: Absolutely not]
jishnu dasgupta
Ranch Hand

Joined: Mar 11, 2011
Posts: 103

Michael Boehm wrote:
The string might for instance be "baz23! foos23foo bar foobar barfoo!foo" and the count should be 2.


Michael isnt the word "foo" actaully appearing 5 times in this String??
Michael Boehm
Ranch Hand

Joined: Jun 02, 2010
Posts: 51
jishnu dasgupta wrote:
Michael isnt the word "foo" actaully appearing 5 times in this String??


Not the way I want to count it. I only want to count exact matches, so for me "foo" only appear twice since it isn't counted in eg. "foos" and "foobar"
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19543
    
  16

So what you want is foo, preceded by nothing, whitespace or punctuation, and followed by nothing, whitespace or punctuation. That looks like a job for positive lookahead / lookbehind:
(?<=^|\s|\p{Punct})foo(?=$|\s|\p{Punct})

That will only result in one match:
- foos23foo does not match since this is one word containing foo, not the word foo itself
- foobar does not match since this is one word containing foo, not the word foo itself
- barfoo does not match since this is one word containing foo, not the word foo itself
- foo matches since it's preceded by only a punctuation character


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Anton Shaykin
Ranch Hand

Joined: Dec 13, 2009
Posts: 57

That looks like a job for positive lookahead / lookbehind

Exactly, and that goes far beyond the "Beginning Java". Regular expressions are all about formalizing your requirements, so first you have to define what you mean by "word". Because, according to the common regexp vocabulary, a word character could be described by the following pattern [a-zA-Z_0-9]. As I see in your case, you mean something different.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 36508
    
  16
Anton Shaykin wrote: . . . that goes far beyond the "Beginning Java". . . ..
Agree. Moving thread.
Michael Boehm
Ranch Hand

Joined: Jun 02, 2010
Posts: 51
I managed to do what I wanted. I used an appropriate Pattern and then I count by using split on the string containing the text. However this is quite slow.
Luigi Plinge
Ranch Hand

Joined: Jan 06, 2011
Posts: 441

This works for what you described in your example, although it may be what you have already:
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Counting exact matches of substring.
 
Similar Threads
How to use OR in Regular Expressions?
search string from bigger string
How to mask string not conforming to a regular expression pattern
matcher find() problems
Regular Expressions: A String should not contain the word "TEST"