Hi I have to read a csv file and an excel file (these are the 2 input files to me). The csv file contains data about individuals and the excel file contains mapping info. The csv file is expected to have about 300 rows of 100 cols each and the excel file about 50 rows of 10 cols each. The csv file will contain a header row as the first row. I have to pick the values from the csv file,see if a matching mapping exists in the excel mapping file and do some calculations. Not all the 100 cols of the csv file will be used, it depends on the mapping file (which can have diff mappings).i have to find all those values in the csv file, which have a mapping in the excel file and do caluculations for those values. This has to be done for each row of the csv file. Now the question is, do i read each line of the csv, use Stringtokenizer to split and do the processing or use some of the api like ostermiller etc. speed is of utmost importance to me.(Of course if i use Stringtokenizer, after reading the forst data row, i will save the details of which useful col(to be used in calculations) is at which index, so that i need not do the search for col in the subsequent rows) i read a comparative study about opencsv and it said that using StringTokenizer may be faster. I am also ok with reading the entire file in 1 shot(saving the i/o) and do the processing, but a bit wary about the size of the input csv file which may be around 1 MB.
Expert comments are needed.
If I become filthy rich, I'll sponsor research for painless dental treatment at Harvard Medical School. Thats why,I'm learning Java.I have 32 teeth, 22 are man made.
1 MB isn't much. I'd start with using one of the CSV libraries and see if that isn't fast enough for your purposes. If it isn't, you can always code up something else, although I'd guess that these libraries aren't particularly slow. (Note that reading CSVs isn't all that trivial once you consider newlines characters, cells containing commas or semicolons, and quoted content.)
For reading XLS files, check out the Apache POI library.
Ping & DNS - updated with new look and Ping home screen widget
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.) When to optimize
Hi Thanks Ulf and Joe for the really prompt replies. Well,i've been assured that the input csv file will be mostly numeric values,without comma or other spl chars. So i don't need the extra functionality provided by the CSV libraries. And reading using a BufferedReader won't be a problem, atleast. What worries me the most is speed. Correct me if i'm wrong, won;t these libraries also be builidng on BufferedReader to read the csv? If yes, won't using BufferedReader be a faster approach, if i don't need the extra functionalities?
[ December 08, 2008: Message edited by: Jhakda Velu ] [ December 09, 2008: Message edited by: Jhakda Velu ]
Joined: Mar 22, 2005
Since it's so easy to test both ways, it hardly seems worthwhile spending a lot of thought on this.
But heed Joe's advice, and only start thinking about this once you've made sure that you need to worry about performance.
Originally posted by Jhakda Velu: Correct me if i'm wrong, won;t these libraries also be builidng on BufferedReader to read the csv?
The library I looked at once for CSV processing didn't use that, or anything else. It took Strings that I passed it and parsed them into tokens as per the CSV rules. Which is as it should be. CSV parsers shouldn't be involved in reading from disks or URLs or anything else.
But I agree with Ulf. It's time to stop speculating now. You have plenty of information to work with.
Joined: Feb 26, 2008
Hi All Thanks for the advice. I will try out one of the ways today and let you all know whether i meet the performance criteria,which happens to be 9 secs. This includes the time for uploading the 2 files,processing it at the server, producing the output file and sending it back to client. So roughly i have 2-3 secs to do the processing. And yes, i'll be reading from a stream at the server. I'm likely to do the read line--> split string approach.
Joined: Feb 26, 2008
Hi All I have adopted the read each line of file approac, instead of using the libraries like OpenCSV and others. Major reason being i didn't need any of the fnacy additional functionalities provided by them I am able to upload the input csv file and the mapping xls file, read them at the server and do the processing and return the output file in 6-7 secs. I used jxcel api to read write/excel file . My inout file had data about users, each row representing data for 1 user.For each user, there were about 100 cols of data, all of which were are not used to generate teh output file. from the mapping file, i get the meaning of the cols of the input file,and the cols in the input file are not in the same order as the mapping file.For each user date, i end up generating around 70 rows in the output file.This could change if the mapping file is updated. for these reasons, i did not use the VO approach, as that would mean updating the code everytime they update the mapping or input file. The approach i used was to read the first row of the input file, which contained header infor. this info is used to map witht the mapping file. So as soon as i read the first row, i added the col heading and their respective position as key value pair in a HashMap. For subsequent rows, i need not loop to find the column position.