File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Beginning Java and the fly likes Finding Most Common Phrase Occurance In String? Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login


JavaRanch » Java Forums » Java » Beginning Java
Reply Bookmark "Finding Most Common Phrase Occurance In String?" Watch "Finding Most Common Phrase Occurance In String?" New topic
Author

Finding Most Common Phrase Occurance In String?

Justin Filmer
Greenhorn

Joined: Jul 04, 2011
Posts: 27
Hey guys, this is an interesting problem.
Let's say I have a string like this:


I want to be able to pick out the two or three word phrase that occurs both most often and second-to-most often in the string, while ignoring common words such as "I","and","is", etc. In the example string I provided, the most common phrase returned by the method should be "coding algorithms" and the second-most common phrase returned should be "love writing code".

Any ideas / code samples on how to do this? I'm thinking first, remove the common words, then use some type of dictionary that keeps track of relative percentages for all consecutive phrases. Then pick the highest two percentages from the dictionary. Now, how can we actually turn that into Java code?
Justin Filmer
Greenhorn

Joined: Jul 04, 2011
Posts: 27
Thank you Ulf Dittmer for the fixing of my String! Anyone have any ideas for the alogrithm/coding aspect of the problem?
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19232

If you'd SearchFirst you'd find a few similar threads. In the last one I encountered I suggested separating the problem into two sub-problems. In your case that would be three:
1) get a count for the number of words
2) filter out some words (I, is, etc)
3) sort the remainder

1) is usually done by using a Map<String,Integer>, where the keys are the words and the values are the occurrences. Use a TreeMap for to ignore the case of the words.
2) can be done by having a Collection<String> (or Set<String>) with too-common words, then removing those from the map (map.keySet().removeAll(commonWords)).
3) can be done by adding all the Map.Entry objects into a List that you then sort using Collections.sort and a custom Comparator.

After those steps you can use the List to access the entries in the right order.


SCJP 1.4 - SCJP 6 - SCWCD 5
How To Ask Questions How To Answer Questions
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Finding Most Common Phrase Occurance In String?
 
Similar Threads
Abstraction and Encapsulation
Tough Question: Detect Partial Duplicates
Is this possible?
What is an interface ?
Algorithms and my burst bubble of programmers bliss