aspose file tools*
The moose likes Java in General and the fly likes Parsing CSV files with quotes Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Parsing CSV files with quotes" Watch "Parsing CSV files with quotes" New topic
Author

Parsing CSV files with quotes

Scott Selikoff
Saloon Keeper

Joined: Oct 23, 2005
Posts: 3704
    
    5

Hi All,

Normally, if I want to parse a string of values separate by commas, I used something like:

String[] values = line.split(",");

The problem is that I have a file in which some values are text strings with escaped quotes such as:

1,2,"My dog is white, and the zoo is far away",cat

Notice the , inside the quotes to indicate its not a new field.

Anyway... I know there are a lot of CSV readers out there, but they seem a little overly-complicated to me. I'm not great with regular expressions, but is there a simple way to modify the original split() command to use the more complex structure? The only catch is that the quotes are optional... some fields they may be there, some they may not be.
[ June 19, 2007: Message edited by: Scott Selikoff ]

My Blog: Down Home Country Coding with Scott Selikoff
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41820
    
  62
I don't think that can be achieved with a single regexp. But there are other complications to consider as well (e.g., there may be newline characters in the strings which would throw off a simple line-by-line reading algorithm).

Do yourself a favor and use one of the ready-made CSV parsers; they're worth the bit of extra effort.


Ping & DNS - my free Android networking tools app
Gavin Tranter
Ranch Hand

Joined: Jan 01, 2007
Posts: 333
I dont see why you shouldnt be able to use regx to do this, it might take a couple of parse, but my regx aint great either.

if we assume one line equals one record, then we can assume that any text between "" is one field (equivilent to the text between ,,), we can spilt it out manually replacing it with a "token" record its postion in the field list, then once a spilt has been preformed replace the "token" in the array with the string.

Or you could tell a regx expressin that if it finds a , between " it is to replace it with a "token", use spilt as normal, then replace the "token" with ,

you could always just replace ," with , before spiliting the string.

Of course those are just some ideas off the top of my head, hope they help, but cant say they are useful or any good.

G

PS your a certified programmer? not sure why this would present an issue *shrugs*

[ June 20, 2007: Message edited by: Gavin Tranter ]
[ June 20, 2007: Message edited by: Gavin Tranter ]
Bill Cruise
Ranch Hand

Joined: Jun 01, 2007
Posts: 148
I'm not sure I follow you. Do you still want the line to split at the commas?

So,

1,2,"My dog is white, and the zoo is far away",cat

becomes

1
2
My dog is white, and the zoo is far away
cat

Is this correct, or should I leave the quotes around the phrase?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41820
    
  62
if we assume one line equals one record

As I pointed out, not a correct assumption for CSV in general (although it may be correct for the files Scott is working with).

then we can assume that any text between "" is one field

Nope. The text might contain double quotes -which would be escaped by doubling them-, so the text foo "42" bar would become "foo ""42"" bar" in the CSV file.

Not to mention the fact that CSV files are sometimes SCSV files - semicolon separated values.

Before you've got all that coded up -regexps or no regexps- you will have gotten an existing CSV class working.
Gavin Tranter
Ranch Hand

Joined: Jan 01, 2007
Posts: 333
Like I said, they are assumptions.
Depending on the use of the CSV it should be possible to define its syntax fairly strictly.

I do agree with what you are saying about double quotes etc, but the original poster dosnt seem to want to use a 3rd party parser, so I was offering up some options based on the assumption I made.
Sometimes, its not worth a full blown parse if its a simple one off need.

Perhaps there is call for a CSV parser in Jakarta commons?

Of course if you had control of the application creating the CSV you could always change the seperator, to something not used.

G
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Parsing CSV files with quotes