jQuery in Action, 2nd edition*
The moose likes Other Open Source Projects and the fly likes Suitable CSV Library for Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Suitable CSV Library for Java" Watch "Suitable CSV Library for Java" New topic
Author

Suitable CSV Library for Java

Ankit Gohil
Greenhorn

Joined: Nov 28, 2013
Posts: 7

Hi Guys,
I'm badly in search of a good Open-Source Library for CSV that works with Java. After some research, I found some of them like Commons CSV, OpenCSV, SuperCSV. There are few parameters on basis of which I have to make my selection and after some research I was able to obtain info on some of these but not all. So I hope someone here can help me with the remaining.

Problem:
The CSV file would be consisting of student records such that one student can have more than 1 record. and records of a particular student will always be together in the file. Example:

Id, name, grade, subject, marks
S1, abc, 5th, English, 88
S1, abc, 5th, Maths, 80
S1, abc, 5th, History, 85
S1, abc, 5th, English, 82
S2, xyz, 5th, English, 78
S2, xyz, 5th, Maths, 80
S3, pqr, 6th, Maths, 89

Some unanswered questions are:

1. Which library has in-built validations for detecting formatting errors in the csv file.
2. Which library supports multi-threading as I want to process records pertaining to different students in different threads.

Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 30123
    
150

Ankit,
Welcome to CodeRanch!

How many students are in the file? Unless it is an extremely large number, your best bet might be to load the file into a ConcurrentMap (using one of those libraries) and doing your parallel processing in Java.


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
Philip Thamaravelil
Ranch Hand

Joined: Feb 09, 2006
Posts: 99
 
Ankit Gohil
Greenhorn

Joined: Nov 28, 2013
Posts: 7

Jeanne Boyarsky wrote:Ankit,
Welcome to CodeRanch!

How many students are in the file? Unless it is an extremely large number, your best bet might be to load the file into a ConcurrentMap (using one of those libraries) and doing your parallel processing in Java.


Thanks Jeanne. The file size could run into Gigabytes also so I don't want to get OutofMemory errors by loading the whole file into memory at once. Also my last night research helped me to understand that none of the libraries support multi-thraeding so I'll have to handle it manually, which is not the issue.

Now my major concerns is: How does these library actually read ??
I have learnt that Commons CSV reads the complete file at once and store it in memory while others (OpenCSV, SuperCSV) read line by line.
Do they actually read a line at a time from the file and store it in some buffer area. How is it actually processed. Can you give me in-depth info on this ??
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41124
    
  45
Don't need a framework, just read the file line by line and split each line. For validation, check the values in the split array.

No, don't do that. There's a reason these libraries exists, and that's because CSV is not quite so simple as it appears at first. Before you're done coding all the special cases (which you need to in case somebody uses them), you might as well use a library.


Ping & DNS - my free Android networking tools app
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41124
    
  45
Ankit Gohil wrote:Do they actually read a line at a time from the file and store it in some buffer area. How is it actually processed. Can you give me in-depth info on this ?

The libraries are all open source, and not big in size - might as well just look at the source if you're interested. I have always just used them, and never worried about how they work under the hood. I'd be surprised if any of them read the entire file into memory instead of line by line, as it would be an obvious and needless inefficiency.
Ankit Gohil
Greenhorn

Joined: Nov 28, 2013
Posts: 7

Ulf Dittmer wrote:
Ankit Gohil wrote:Do they actually read a line at a time from the file and store it in some buffer area. How is it actually processed. Can you give me in-depth info on this ?

The libraries are all open source, and not big in size - might as well just look at the source if you're interested. I have always just used them, and never worried about how they work under the hood. I'd be surprised if any of them read the entire file into memory instead of line by line, as it would be an obvious and needless inefficiency.


Also, If you have any info on the below things please let me know.

  • Does any of these libraries provide any in-built formatting validations. If yes, are they customizable ?
  • Is multi-threading in any form supported by any library ? For my code I'm actually aiming for thread-pooling. Example: Say my CSV consists of 100 records or lines (assuming each record per line). Now I want to create thread T1 and make it read records 1-10, create T2 and make it read records from 11-20, T3 21-30 and so on.... or if I have to do it manually, then how ?
  • Ulf Dittmer
    Marshal

    Joined: Mar 22, 2005
    Posts: 41124
        
      45
    No idea. I would imagine that the documentation of the libraries talks about that. I am almost certain that none supports your second point, simply because only one process can open a file at any given time. But concurrent I/O should not be necessary - just read the CSV into memory, and then create multiple threads that work with the in-memory representation.
    K. Tsang
    Bartender

    Joined: Sep 13, 2007
    Posts: 2242
        
        7

    Hello Ankit

    I have used OpenCSV a few times. Regarding the library reading line by line or all at once, from my observation it's line by line. Use OpenCSV as example. Its API has 2 read() methods

    public String[] read(Reader file);
    public List<String[]> read(Reader file);

    You may have guessed the method returning List<String[]> reads the entire file into memory. This option has its good and bad points. Good point it you can easily go to the footer at the end of the file instead of reading line by line until the end. The bad thing is large files will cause OutOfMemoryError.

    The threading issue I'm not aware such libraries has what you want. Normally if there are many files to process, one file per thread. What you do in the thread (eg to spawn new sub-threads) is up to you.

    K. Tsang JavaRanch SCJP5 SCJD/OCM-JD OCPJP7 OCPWCD5
    Ankit Gohil
    Greenhorn

    Joined: Nov 28, 2013
    Posts: 7

    K. Tsang wrote:Hello Ankit

    I have used OpenCSV a few times. Regarding the library reading line by line or all at once, from my observation it's line by line. Use OpenCSV as example. Its API has 2 read() methods

    public String[] read(Reader file);
    public List<String[]> read(Reader file);

    You may have guessed the method returning List<String[]> reads the entire file into memory. This option has its good and bad points. Good point it you can easily go to the footer at the end of the file instead of reading line by line until the end. The bad thing is large files will cause OutOfMemoryError.

    The threading issue I'm not aware such libraries has what you want. Normally if there are many files to process, one file per thread. What you do in the thread (eg to spawn new sub-threads) is up to you.


    Thanks for your reply. Also I would like to know if there is any method available in OpenCSV or any other library that can check for the genuineness of the file(whether its actually a CSV). It could be by checking either the extension of the file or its actual contents.
    Ulf Dittmer
    Marshal

    Joined: Mar 22, 2005
    Posts: 41124
        
      45
    I would imagine that the code throws an exception, or some error code, if the file is not actually properly formatted CSV.
    K. Tsang
    Bartender

    Joined: Sep 13, 2007
    Posts: 2242
        
        7

    Nope at least for OpenCSV.

    Checking the file extension only an indicator. The actual content (whether it's separable or not by some delimiter) is what you are looking for.

    Ankit Gohil
    Greenhorn

    Joined: Nov 28, 2013
    Posts: 7

    @Ulf , @Tsang: So this mean I need to write my own code for checking it. Hope some day these libraries include this functionality or maybe I can provide them with one
    Anyways thanks guys for your support !!
    Stuart A. Burkett
    Ranch Hand

    Joined: May 30, 2012
    Posts: 679
    Ankit Gohil wrote:So this mean I need to write my own code for checking it. Hope some day these libraries include this functionality or maybe I can provide them with one

    I think Ulf's reply indicated the exact opposite of that
    Ulf Dittmer wrote:I would imagine that the code throws an exception, or some error code, if the file is not actually properly formatted CSV.
    Joe Harry
    Ranch Hand

    Joined: Sep 26, 2006
    Posts: 9345
        
        2

    I would go with the option of writing my own CSV parser. May be give camel-csv / camel-bindy a try!


    SCJP 1.4, SCWCD 1.4 - Hints for you, Certified Scrum Master
    Did a rm -R / to find out that I lost my entire Linux installation!
    Ankit Gohil
    Greenhorn

    Joined: Nov 28, 2013
    Posts: 7

    Stuart A. Burkett wrote:
    Ankit Gohil wrote:So this mean I need to write my own code for checking it. Hope some day these libraries include this functionality or maybe I can provide them with one

    I think Ulf's reply indicated the exact opposite of that
    Ulf Dittmer wrote:I would imagine that the code throws an exception, or some error code, if the file is not actually properly formatted CSV.


    @Stuart: It was my bad, what Ulf stated was completely opposite & I understood a bit later... Also I have tried & tested it using SuperCSV.. The code throws an exception if the file contents are not in CSV format because its not able to convert the record into a bean..
    thanks man !!!
    Joe Areeda
    Ranch Hand

    Joined: Apr 15, 2011
    Posts: 307
        
        2

    I've used openCSV for a few projects and find it efficent and easy to use. I'll add a few random observations:

    1) I'm not sure what errors a general CSV parser could possibly detect. I suppose you could pass it a binary file but every text file is technically a csv file even if it has no commas.
    The checks I use are a) number of fields is in a valid range and b) numbers are where they are expected and in the allowed range.

    2) I don't really see any efficiency to opening the same file multiple times in different threads, although most OSs will allow a file to be opened multiple times for reading. I can see opening separate files in separate threads, or having the file reading thread pass individual records to a thread pool to process.

    3) Depending on exactly what you are going to do with this data it might be worth importing the csv into a relational database table and do your processing with SQL queries. If the data is fairly stable and you plan to access it multiple times before re-importing it would be a bit more obvious.

    Joe


    It's not what your program can do, it's what your users do with the program.
    Ankit Gohil
    Greenhorn

    Joined: Nov 28, 2013
    Posts: 7

    @ Joe: Currently I'm evaluating SuperCSV for different functionalities that I require. It also has in-built validation for number range, not null, String matching(using RegEX). I know opening the same file in multiple threads isn't a good idea so I have already dropped it.
    DB is a constraint, I can't use DB in any form.

    Once done with my evaluation of SuperCSV I'll evaluate OpenCSV on the same set of parameters & will post it here.
    Thanks Buddy !!
     
    jQuery in Action, 2nd edition
     
    subject: Suitable CSV Library for Java
     
    Similar Threads
    Will really String Pool will not have duplicate entries.
    Updating duplicate rows
    A method from one class is substitued for that in another class
    String literal pool doubt
    How many String objects gets created