Hey guys, this is an interesting problem.
Let's say I have a string like this:
I want to be able to pick out the two or three word phrase that occurs both most often and second-to-most often in the string, while ignoring common words such as "I","and","is", etc. In the example string I provided, the most common phrase returned by the method should be "coding algorithms" and the second-most common phrase returned should be "love writing code".
Any ideas / code samples on how to do this? I'm thinking first, remove the common words, then use some type of dictionary that keeps track of relative percentages for all consecutive phrases. Then pick the highest two percentages from the dictionary. Now, how can we actually turn that into Java code?
If you'd SearchFirst you'd find a few similar threads. In the last one I encountered I suggested separating the problem into two sub-problems. In your case that would be three:
1) get a count for the number of words
2) filter out some words (I, is, etc)
3) sort the remainder
1) is usually done by using a Map<String,Integer>, where the keys are the words and the values are the occurrences. Use a TreeMap for to ignore the case of the words.
2) can be done by having a Collection<String> (or Set<String>) with too-common words, then removing those from the map (map.keySet().removeAll(commonWords)).
3) can be done by adding all the Map.Entry objects into a List that you then sort using Collections.sort and a custom Comparator.
After those steps you can use the List to access the entries in the right order.