• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

word comparison in two text lists

 
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What's the fastest method of storing 2 word list (both about 20000 words each) then comparing them to find which words occur in the first file that do not appear in the second file?
Cheers
 
author
Posts: 3252
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by leon matthew:
What's the fastest method of storing 2 word list (both about 20000 words each) then comparing them to find which words occur in the first file that do not appear in the second file?

Assuming you have these two lists in-memory, store them both in a HashSet. Then wordSet1.removeAll(wordSet2) will leave only those words in the first list that do not occur in the second list. Scalability is O(m)+O(n), i.e. linear with the number of words, which is as good as it gets. You can shave off some time by rolling your own implementation but that won't change scalability.
If they are in files on disk it's just as simple; read wordFile1 into a HashSet, then go through wordFile2 word by word (e.g. using a StringTokenizer) and call wordSet.remove(word) for each. Performance is again O(n) but you save a bit on object churn and memory footprint.
- Peter
 
leon matthew
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for the advice, i was heading in that general direction after several other attempts (mostly using arrays).
There's a bit of a snag though, i cannot remove words from the list, i have to output a list to the screen, if a word in wordfile2 exists in wordfile2 its outputted with a 'y' beside it, if its not in the file it outputs with an 'n' beside it.
 
Peter den Haan
author
Posts: 3252
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Doesn't make a lot of difference...

Performance of the fragment above is O(n), where n is the number of words in set 1; your code as a whole will be O(n) + O(m).
- Peter
 
leon matthew
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks a million!
 
leon matthew
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
that's working well, but is there a way to get the algorithim into n*log n?
 
Peter den Haan
author
Posts: 3252
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by leon matthew:
that's working well, but is there a way to get the algorithim into n*log n?

Well, sure, if you want a slower(!) algorithm, just replace the HashSet by a TreeSet. I may be stating the obvious, but O(m) + O(n) is better than O(n log n).
- Peter
 
Ranch Hand
Posts: 1953
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Peter,
You need to understand the why.

Originally posted by leon matthew:
that's working well, but is there a way to get the algorithim into n*log n?



[This message has been edited by Roseanne Zhang (edited October 17, 2001).]
 
It's a pleasure to see superheros taking such an interest in science. And this tiny ad:
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic