File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Java in General and the fly likes fast searching in large text files Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "fast searching in large text files" Watch "fast searching in large text files" New topic

fast searching in large text files

Rohan Baweja
Ranch Hand

Joined: Aug 19, 2001
Posts: 31
I am trying to search huge text files in java. I tried to load a dictionary file ( 2.5 mb ) which contained words alphabetically one in a line, into a BinarySearchTree but it gave me a stackoverflow. The only thing that worked for me was LinearSearch, which was extremely slow.
Any suggestions for a solution(s)...
Frank Carver

Joined: Jan 07, 1999
Posts: 6920
I guess you would probably benefit from enlarging your stack space. Check out the arguments to the "java" command.

Read about me at ~ Raspberry Alpha Omega ~ Frank's Punchbarrel Blog
John Dale
Ranch Hand

Joined: Feb 22, 2001
Posts: 399
I wonder if the tree implementation you are using is letting the tree get so out of balance that the usual recursive tree navigation algorithm requires an unreasonably deep stack. The tree implementation used for java.util.TreeMap and java.util.TreeSet should not have this problem, because they keep the tree balanced.
In case you are not familiar with the issue of unbalanced tree...
Suppose you had a simple binary tree implementation that always added entry on the right of the last node if its key was greater than the key of that node (as might be the case for reading a list that was already in alphabetical order). Then each node of the tree would have only one child, and the tree would effectively be a long linked list. A simple recursive tree walking algorithm (likely used in the insert method) generate a very deep stack as the list got long.
There are a number of designs that are used to keep the tree in some sort of balance, such as the red-black algorithm used in TreeMap and TreeSet.
(Perhaps you will get a better response on this from someone who has taken a data structures course.)
Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
[This was written as a follow-up to Frank, before John Dale's response, which gets to the key point much more concisely. But the posting of it was delayed and I'm too lazy to re-edit now...]
...though if you're using Windows I don't believe the standard JDKs have an option for this. Solaris and Linux versions do though. In any event, the fact that you're getting a stack overflow seems suspicious to me - typically this indicates a programming error, such as a recursive method call that doesn't eventually terminate. You can usually tell if this is the case by looking at the stack trace to see if the same method (or set of methods) appears repeatedly.
If you're not able to see the stack trace (maybe because the class cauing it is available only as compiled bytecode and you don't have access to the source) then another approach is to test your program using a substantially smaller input file. (E.g. just make a copy and then delete everthing after the letter C.) If there's an infinite-loop bug in the code causing the stack overflow, there's a good chance it will show up regardless of how big the input file is, or how much memory is allocated.
By the way, where does this BinarySearchTree come from? It's not a standard JDK class, but there seem to be a number of different third party libraries with a class of this name. Maybe the one you're using is just buggy. More likely though, unless you've got a somewhat advanced variation of a binary tree, you problem arises from the fact that by its very nature, a binary tree performs quite horribly when you insert data into it which has already been sorted. (As seems to be the case from your description.) What happens is that the "tree" just turns into a long linked list, and each additional insert requires a linear-time search (accessing every node in the tree/list) before the new element is added to the end. If the put/get operations are defined recursively rather than iteratively, it's not difficult to imagine that the stack would fill up quite quickly as each new node involves a new method call. And you're loading this with 2.5 MB of data? Urgh.
Chances are, your best bet is to use a different data structure than a simple binary tree. Have you tried a TreeMap? Or look through whatever library you got the BinarySearchTree from to find alternatives, like a RedBlackTree or BalancedTree or TreeThatDoesntSuck. Good luck...
[ August 24, 2002: Message edited by: Jim Yingst ]

"I'm not back." - Bill Harding, Twister
David Weitzman
Ranch Hand

Joined: Jul 27, 2001
Posts: 1365
You could just put the stuff into an ArrayList and do a binary search the old fashioned way. It might be slow to load if the ArrayList is initialized way to small and has to be resized several times, but if you know the dictionary size in advance and hard-code it you can actually make this method take less memory than the search tree.
But basically, HashSet and TreeSet just work without further thought.
Rohan Baweja
Ranch Hand

Joined: Aug 19, 2001
Posts: 31
Hey, Hi everyone, I just wanted to thank everyone for the response.
I think, the best solution for the above problem would be to divide the list into sublists containing which contain same starting letters, and then search through them seperately. For example sublists of A's B's C's and so on. It may not work with a 2.5mb file but smaller dictionary file would be managable.
I have read about all sorts of trees and data structures but I didnt read about balanced trees or trees based on red-black algo. So thank you all for enlightening me, and all the help.
David O'Meara

Joined: Mar 06, 2001
Posts: 13459

(OK, so maybe the conversation is over, but I still don't get it.)
There is a large text file and you want to search on it. I'm wondering how the trees and dictionaries go into it.
String searches are not the same as binary or tree searches, and rather than explain my self I'm just gonna paste a link.
(these are a bunch of string-seraching algorthims)
Frank Carver

Joined: Jan 07, 1999
Posts: 6920
I think the original poster wanted to "load" his work list into something which would make repeated searching quicker and simpler. That's why he chose a tree.
To find a particular entry in a balanced tree is an O(logN) operation. A linear search is typically O(N), and in practice much slower if you have to add the overhead of file access every time.
So, if you decide you want lo load a sorted word list into a tree, there's one thing you have to be aware of :- you must somehow keep the tree "balanced" (each "branch" has about the same weight of "twigs" on it).
The problem in this case was probably that the incoming list was sorted already, and so the tree got very unbalanced. Imagine you are building a tree using a standard tree insertion algorithm:
  • at start tree is empty.
  • add "aardvark" :- tree contains "aardvark"
  • add "anteater" :- it's after "aardvark", so put it on the right, tree contains "aardvark" - anteater"
  • add "apple" :- it's after "aardvark", so look right, it's after "anteater", so put it on the right. tree contains "aardvark" - "anteater" - "apple"

  • As you can see, all the entries get added to the right of each preceding node. None of the "left" branches are ever used, as no incoming word ever sorts before any of the others. What we end up with is not really a tree, just an overcomplicated list.
    There are several ways to overcome this, but a fairly simple one to understand is to load all the words into a list first, then build the tree using a binary chop:
  • assume a list of length n
  • fetch element 1/2 n, insert into tree
  • fetch element 1/4 n, insert into tree
  • fetch element 1/8 n, insert into tree
  • fetch element 3/8 n, insert into tree
  • fetch element 3/4 n, insert into tree
  • fetch element 5/8 n, insert into tree
  • fetch element 7/8 n, insert into tree
  • etc.

  • I leave the coding of a suitable and correct recursive algorithm as an excercise.
    Has that helped?
    [ August 26, 2002: Message edited by: Frank Carver ]
    I agree. Here's the link:
    subject: fast searching in large text files
    It's not a secret anymore!