This week's book giveaway is in the HTML Pages with CSS and JavaScript forum.
We're giving away four copies of Testing JavaScript Applications and have Lucas da Costa on-line!
See this thread for details.
Win a copy of Testing JavaScript Applications this week in the HTML Pages with CSS and JavaScript forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Bear Bibeault
  • Ron McLeod
  • Jeanne Boyarsky
  • Paul Clapham
Sheriffs:
  • Tim Cooke
  • Liutauras Vilda
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • fred rosenberger
  • salvin francis
Bartenders:
  • Piet Souris
  • Frits Walraven
  • Carey Brown

Finding Most Common Phrase Occurance In String?

 
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey guys, this is an interesting problem.
Let's say I have a string like this:


I want to be able to pick out the two or three word phrase that occurs both most often and second-to-most often in the string, while ignoring common words such as "I","and","is", etc. In the example string I provided, the most common phrase returned by the method should be "coding algorithms" and the second-most common phrase returned should be "love writing code".

Any ideas / code samples on how to do this? I'm thinking first, remove the common words, then use some type of dictionary that keeps track of relative percentages for all consecutive phrases. Then pick the highest two percentages from the dictionary. Now, how can we actually turn that into Java code?
 
Justin Filmer
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you Ulf Dittmer for the fixing of my String! Anyone have any ideas for the alogrithm/coding aspect of the problem?
 
Sheriff
Posts: 21972
106
Eclipse IDE Spring VI Editor Chrome Java Ubuntu Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you'd SearchFirst you'd find a few similar threads. In the last one I encountered I suggested separating the problem into two sub-problems. In your case that would be three:
1) get a count for the number of words
2) filter out some words (I, is, etc)
3) sort the remainder

1) is usually done by using a Map<String,Integer>, where the keys are the words and the values are the occurrences. Use a TreeMap for to ignore the case of the words.
2) can be done by having a Collection<String> (or Set<String>) with too-common words, then removing those from the map (map.keySet().removeAll(commonWords)).
3) can be done by adding all the Map.Entry objects into a List that you then sort using Collections.sort and a custom Comparator.

After those steps you can use the List to access the entries in the right order.
    Bookmark Topic Watch Topic
  • New Topic