aspose file tools*
The moose likes I/O and Streams and the fly likes Read million rows from .csv Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Read million rows from .csv" Watch "Read million rows from .csv" New topic
Author

Read million rows from .csv

Suvojyoty Saha
Greenhorn

Joined: Apr 19, 2011
Posts: 25

I have always used



the above code for reading the file and then use reader.readLine() for reading each line of the file.

But many interviewers have asked the question as to how to tackle a scenario if we had to read millions of records.

what should be the answer to the above question? it will be very kind if anybody can provide me some links where i can increase my knowledge base.

thanks,
Suvojyoty
Matthew Brown
Bartender

Joined: Apr 06, 2010
Posts: 4490
    
    8

I don't know what they're looking for, but my opinion would be that the excerpt you've showed there wouldn't necessarily be different for a huge file. The differences would be in what you do with it. For example, you might not be able to read the entire file into memory and then process the results - you'd probably be looking to completely handle one record at at time.
Suvojyoty Saha
Greenhorn

Joined: Apr 19, 2011
Posts: 25

i know Mathew. I guess the interviewer wanted to retrieve all the info as quickly as possible.

One way i could think of as to increase the speed slightly was by introducing threads.

1. one thread to read each line and insert into a collection
2. the second one using the collection to insert the values onto the database.

Apart form this i am not sure as to how to increase the speed.
Jesus Angeles
Ranch Hand

Joined: Feb 26, 2005
Posts: 2061
The data format is important. If it is ordered, you may not be able to use threads.

Depending on what data you have (e.g. keyed, sorted, has unneeded rows, has huge data on each row, has ONLY 1 byte on each row, etc.), you may first need to pre-process it, if possible for faster processing.

Second is process the data the best way possible, e.g. threads, etc.

Another thing on rows with huge data, according to experience, your process can take 3 months, instead of 3 hours, if you load huge rows to memory unnecessarily. Try keep huge data out of memory until needed.
Suvojyoty Saha
Greenhorn

Joined: Apr 19, 2011
Posts: 25

I understand Jesus. But data format can come into picture only after reading a line through BufferedReader.

But how to speed up the reading process itself?
Jesus Angeles
Ranch Hand

Joined: Feb 26, 2005
Posts: 2061
No one will know. The interviewer is testing your approach to problem solving.
Suvojyoty Saha
Greenhorn

Joined: Apr 19, 2011
Posts: 25

hehe. Thanks Jesus for the support.

but seriously is there any better way of increasing the performance by loading chunks of data instead of loading each line and inserting onto the database?
Jesus Angeles
Ranch Hand

Joined: Feb 26, 2005
Posts: 2061
For such requirements, you can research on etl techniques (extract, translate, load).

For example, if you need to process 10 million record per hour, you really need to optimize your etl process. The data format is important. Your optimization will base on it. For example, by doing preprocessing in one of the stages - e , t or l.

Of course, better hardware will help.

On not so extreme cases, you can consider using other languages and software, if your current process is too slow.
Suvojyoty Saha
Greenhorn

Joined: Apr 19, 2011
Posts: 25

Thanks Jesus

I understand what you are trying to say. Informatica is an excellent ETL tool. In similar perspective we can also use SQL Developer or TOAD to read the files and insert onto the database. The data format is extremely important for both.

But my question remains the same instead of using these tools if i have to operate with only Java what is the best way to do so.
Jesus Angeles
Ranch Hand

Joined: Feb 26, 2005
Posts: 2061
I have used java only on my etl systems. You can implement your own etl, and it is best that way sometimes.
Suvojyoty Saha
Greenhorn

Joined: Apr 19, 2011
Posts: 25

Sorry for the late reply Jesus. I guess we have returned to the beginning of our discussion.

Thanks
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Read million rows from .csv