*
The moose likes Performance and the fly likes Sorting huge files by using an index Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Sorting huge files by using an index" Watch "Sorting huge files by using an index" New topic
Author

Sorting huge files by using an index

Prashant Sehgal
Ranch Hand

Joined: Jun 20, 2003
Posts: 56
Hi,

How does one index a flat (CSV) file for the purposes of sorting? If I know of the exact columns on which I have to sort (or generate the index for), what are the various techniques available for doing it? I read through the web and all indexing related tutorials were targeted towards DBMSes. Does anyone know of a tutorial for indexing flat CSV files? Or has anyone reading this post tried it before?

My file is not fixed width, so using a RAF is not recommended.

I know of internal and external sorting algorithms, but have never tried indexing a flat file before. How does it work?

Thanks,
Prashant.
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Can you please explain what the purpose of indexing the file is? How will the index be used?


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
Since your record length is not fixed you are going to end up using a RandomAccessFile to make use of the index - it is the only way to jump to a record start.
Off the top of my head I would say that you will have to scan the file looking for the starts of lines, for each line start record the file position and grab the content you are going to sort on - sounds like a job for a custom object containing two variables:

long fposition ;
String key ;

store those guys in a collection - maybe a TreeMap - when your scan is done, the TreeMap will have the sorted order and the fposition will point to the line start so you can do a RAF seek to it.

Bill
Prashant Sehgal
Ranch Hand

Joined: Jun 20, 2003
Posts: 56
Has anyone heaad of Lucene?

It's an indexing API form jakarta.org
Richard Rodger
Greenhorn

Joined: Jan 07, 2005
Posts: 17
How big is the file? Simplest solution is just to read in all in, sort in memory and dump it out again.

Otherwise go with a simple in-process database like http://hsqldb.sourceforge.net
and load the CSV file into a table


Richard Rodger<br /><a href="http://www.ricebridge.com" target="_blank" rel="nofollow">http://www.ricebridge.com</a>
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: Sorting huge files by using an index
 
Similar Threads
PDF file indexing and Searching using lucene
Sorting huge files by using an index
Suggestion Needed on tuning the performance.
importing data from a VSAM file
Flat File Index (Very Urgent)