• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Sort very large data without using database

 
shalini raj
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How to sort large data (say 100 million) integer numbers in a machine with 64 mb ram without using database
 
Ernest Friedman-Hill
author and iconoclast
Marshal
Pie
Posts: 24208
35
Chrome Eclipse IDE Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

Welcome to JavaRanch!

The classic, most general solution is an N-way merge sort. Divide the data into chunks that fit in memory, and sort each one in-core, then write out individual sorted data files. Now open all the files, and read one number from each. Write the largest one to the final file, and replace it with a number from the same file. Repeat. Continue until all the files are empty.

But there are other things you can do if you know more about the data. For example, what if the numbers are all unique and have a known maximum number of digits (for example, a large collection of US phone numbers?) Then you can create an output array that contains just one bit to represent each possible number, and just run through the 100 million numbers on disk, setting the bit in the array corresponding to each number you find. Then you run through the bit array, writing the full number for each "on" bit to a final file. This last algorithm is linear in the size of the data -- quite a feat! The idea comes from Jon Bentley's "Programming Pearls", by the way -- a marvelous little book.
 
Chetan Parekh
Ranch Hand
Posts: 3640
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Ernest Friedman-Hill:
The idea comes from Jon Bentley's "Programming Pearls", by the way -- a marvelous little book.


Thanks for suggesting book.
 
Srinivas Kalvala
Ranch Hand
Posts: 257
Firefox Browser Hibernate Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello,

Always there will be a trade off between the memory and performance. Can you tell me, will we get this type of situation in real world. If we get this situation, then will we limit memory to 64mb and will think about good algorithm. probably no, and we have to go for tradeoffs.

More inputs are welcome.
 
Stan James
(instanceof Sidekick)
Ranch Hand
Posts: 8791
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The classic, most general solution is an N-way merge sort. Divide the data into chunks that fit in memory, and sort each one in-core, then write out individual sorted data files. Now open all the files, and read one number from each. Write the largest one to the final file, and replace it with a number from the same file. Repeat. Continue until all the files are empty.


Brings back memories ... on 70's vintage mainframes we could see the tape drives write a few records to each tape, spin them all back, merge them together, and, as you say, repeat more or less forever.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic