This week's giveaway is in the Spring forum.
We're giving away four copies of Learn Spring Security (video course) and have Eugen Paraschiv on-line!
See this thread for details.
Win a copy of Learn Spring Security (video course) this week in the Spring forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

word comparison in two text lists

 
leon matthew
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What's the fastest method of storing 2 word list (both about 20000 words each) then comparing them to find which words occur in the first file that do not appear in the second file?
Cheers
 
Peter den Haan
author
Ranch Hand
Posts: 3252
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by leon matthew:
What's the fastest method of storing 2 word list (both about 20000 words each) then comparing them to find which words occur in the first file that do not appear in the second file?
Assuming you have these two lists in-memory, store them both in a HashSet. Then wordSet1.removeAll(wordSet2) will leave only those words in the first list that do not occur in the second list. Scalability is O(m)+O(n), i.e. linear with the number of words, which is as good as it gets. You can shave off some time by rolling your own implementation but that won't change scalability.
If they are in files on disk it's just as simple; read wordFile1 into a HashSet, then go through wordFile2 word by word (e.g. using a StringTokenizer) and call wordSet.remove(word) for each. Performance is again O(n) but you save a bit on object churn and memory footprint.
- Peter
 
leon matthew
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the advice, i was heading in that general direction after several other attempts (mostly using arrays).
There's a bit of a snag though, i cannot remove words from the list, i have to output a list to the screen, if a word in wordfile2 exists in wordfile2 its outputted with a 'y' beside it, if its not in the file it outputs with an 'n' beside it.
 
Peter den Haan
author
Ranch Hand
Posts: 3252
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Doesn't make a lot of difference...

Performance of the fragment above is O(n), where n is the number of words in set 1; your code as a whole will be O(n) + O(m).
- Peter
 
leon matthew
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks a million!
 
leon matthew
Greenhorn
Posts: 27
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
that's working well, but is there a way to get the algorithim into n*log n?
 
Peter den Haan
author
Ranch Hand
Posts: 3252
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by leon matthew:
that's working well, but is there a way to get the algorithim into n*log n?
Well, sure, if you want a slower(!) algorithm, just replace the HashSet by a TreeSet. I may be stating the obvious, but O(m) + O(n) is better than O(n log n).
- Peter
 
Roseanne Zhang
Ranch Hand
Posts: 1953
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Peter,
You need to understand the why.
Originally posted by leon matthew:
that's working well, but is there a way to get the algorithim into n*log n?


[This message has been edited by Roseanne Zhang (edited October 17, 2001).]
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic