The moose likes Java in General and the fly likes collections Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "collections" Watch "collections" New topic


rahul Delhi

Joined: Oct 06, 2006
Posts: 2
hi friends,
I am having a file containing million of words per line and i hav to find the duplicate words in it with their occurence.
So i am using tree set collection but after storing around 5,00,000 of words , it gives me error .. running out of heap space...

class WordType implements Comparable<WordType>{
String word=null;
int no_Of_Occur=0;
List list=null;

public int compareTo(WordType obj){
return word.compareToIgnoreCase(obj.word);

class Duplicate words{

TreeSet<WordType> wordTreeSet=new TreeSet<WordType>();
WordType obj=new WordType();


System.out.println("Word read:"+obj.word);

public static void main(String[] args) throws Exception {
DuplicateWords dwObj=new DuplicateWords();


Please help in this...
Barry Gaunt
Ranch Hand

Joined: Aug 03, 2002
Posts: 7729
"rahul kk k kkk", please read our Java Ranch Naming Policy and change your displayed name to comply with it.

This is not a topic specific to SCJP, so I am moving it to our Java In General (Intermediate) forum...

Ask a Meaningful Question and HowToAskQuestionsOnJavaRanch
Getting someone to think and try something out is much more useful than just telling them the answer.
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
I'd cheat and use several programs, some of which you probably already have with the OS or your favorite toolkit. Split the file into a new file or pipe stream with one word per line, sort them into another file or stream, then count dupes as they come through together.

input file | word splitter | sort | dupe counter

This loses the location information unless your splitter can put that on the line with each word.


Editing to add a reference to an old favorite of mine. A Ternary Search Tree is a very fast way to store words, and only stores the differences between them. That is, PART and PARTICLE would share the PART. That just might save enough memory to run your first test file, but still blow up on a bigger one later. Or it might take more memory from the get go for all those one-letter nodes.
[ October 06, 2006: Message edited by: Stan James ]

A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
I agree. Here's the link:
subject: collections
It's not a secret anymore!