wood burning stoves 2.0*
The moose likes Java in General and the fly likes word comparison in two text lists Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "word comparison in two text lists" Watch "word comparison in two text lists" New topic
Author

word comparison in two text lists

leon matthew
Greenhorn

Joined: Oct 10, 2001
Posts: 27
What's the fastest method of storing 2 word list (both about 20000 words each) then comparing them to find which words occur in the first file that do not appear in the second file?
Cheers
Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
Originally posted by leon matthew:
What's the fastest method of storing 2 word list (both about 20000 words each) then comparing them to find which words occur in the first file that do not appear in the second file?
Assuming you have these two lists in-memory, store them both in a HashSet. Then wordSet1.removeAll(wordSet2) will leave only those words in the first list that do not occur in the second list. Scalability is O(m)+O(n), i.e. linear with the number of words, which is as good as it gets. You can shave off some time by rolling your own implementation but that won't change scalability.
If they are in files on disk it's just as simple; read wordFile1 into a HashSet, then go through wordFile2 word by word (e.g. using a StringTokenizer) and call wordSet.remove(word) for each. Performance is again O(n) but you save a bit on object churn and memory footprint.
- Peter
leon matthew
Greenhorn

Joined: Oct 10, 2001
Posts: 27
Thanks for the advice, i was heading in that general direction after several other attempts (mostly using arrays).
There's a bit of a snag though, i cannot remove words from the list, i have to output a list to the screen, if a word in wordfile2 exists in wordfile2 its outputted with a 'y' beside it, if its not in the file it outputs with an 'n' beside it.
Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
Doesn't make a lot of difference...

Performance of the fragment above is O(n), where n is the number of words in set 1; your code as a whole will be O(n) + O(m).
- Peter
leon matthew
Greenhorn

Joined: Oct 10, 2001
Posts: 27
Thanks a million!
leon matthew
Greenhorn

Joined: Oct 10, 2001
Posts: 27
that's working well, but is there a way to get the algorithim into n*log n?
Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
Originally posted by leon matthew:
that's working well, but is there a way to get the algorithim into n*log n?
Well, sure, if you want a slower(!) algorithm, just replace the HashSet by a TreeSet. I may be stating the obvious, but O(m) + O(n) is better than O(n log n).
- Peter
Roseanne Zhang
Ranch Hand

Joined: Nov 14, 2000
Posts: 1953
Peter,
You need to understand the why.
Originally posted by leon matthew:
that's working well, but is there a way to get the algorithim into n*log n?


[This message has been edited by Roseanne Zhang (edited October 17, 2001).]
 
 
subject: word comparison in two text lists