aspose file tools*
The moose likes Java in General and the fly likes Regular Expression To Parse CSV Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regular Expression To Parse CSV" Watch "Regular Expression To Parse CSV" New topic
Author

Regular Expression To Parse CSV

Dave Hewy
Ranch Hand

Joined: Aug 21, 2003
Posts: 93
Hi - I need some help to create a regular expression that will parse a line of comma separated values. The problem is that, some of the values have embedded commas that I want to ignore. Here's an example...

100,to_date('18-Jan-2001','dd-mon-yyyy'),-6,0,1,0,1,'M','Male',10,'2','M'

Look at the value starting to_date...., I don't want the embedded comma to act as a value separator.

Can anyone help with a Regex I can use on String.split to get an array of the values?

Thanks

Dave.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42648
    
  65
Unless the point is specifically to do this with regexps, why not use a library that reads CSV files, like this one? CSV has a number of edge cases you need to consider, and before you have implemented all those, you're probably done using ready-made code.


Ping & DNS - my free Android networking tools app
Dave Hewy
Ranch Hand

Joined: Aug 21, 2003
Posts: 93
Yes, I would like to do this with regex if possible, before I investigate other methods.
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
I'm quite sure that it's not possible.


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
I think Ilja can say that with confidence because the input you provided is not a "regular language" and cannot be parsed by regular expressions. This distinction gets way over my head in language theory but the shortest and most applicable tip I could find is: "... a language that allows parenthesized expressions, but requires the parentheses to balance, cannot be a regular language, and so the language cannot be generated by a regular grammar ..." from WikiPedia.

I don't think there is a true standard for CSV, but your to_date expression should probably be inside quotes. CSV generators and parsers I've used (eg Excel) use the quotes to know they should ignore the comma in the middle.

"to_date('18-Jan-2001','dd-mon-yyyy')"

Does that give you some ideas on how to parse this stuff? Be sure to consider strings with quotes inside them, too!


A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 19004
    
  40

I agree. Regular expressions isn't able to match expressions where you have to keep track of matching closing braces to unlimited depth. Heck, even if you limit the depth, it can get ridiculously complicated.

For example, if I limit the depth to only one set of "()" pairs, the regex becomes...



This should work for your example string, but will fail, if the "to_date" function contains an another function as one of its parameters.

Henry
[ May 18, 2006: Message edited by: Henry Wong ]

Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42648
    
  65
Also be aware that -despite of their name- CSV files can have semicolons instead of commas, and that strings can include newline characters and still be perfectly valid CSV (thus you can't just process the file line by line). Considering all this, go with a ready-made solution
Dave Hewy
Ranch Hand

Joined: Aug 21, 2003
Posts: 93
Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.

Having said that, I'm not a regex expert, hence my original post!

I can quite easily do this with Java, but I have a lot of these to parse and thought regex would probably be faster - but in reality, it probably won't make that much difference.

Thanks anyway for your replies.

Dave.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 19004
    
  40

Can anyone help with a Regex I can use on String.split to get an array of the values?


Oh... the question is for the split() method (and not for the find() method).

For the split() method, it is not possible. The size of the parameters is not even fixed, so you can't even use a combination of zero-width negative look-aheads and look-behinds, to limit the scope of the commas.

Henry
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 19004
    
  40

I can quite easily do this with Java, but I have a lot of these to parse and thought regex would probably be faster - but in reality, it probably won't make that much difference.


Well, regular expressions do match most of the time, which makes parsing a breeze. For other cases, regular expressions can be used to match tokens during parsing, making it very easy to write a parser.

Just because the regex engine couldn't match with a single expression, doesn't mean you have to write a parser without it.

Henry
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42648
    
  65
Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.


Mathematical expressions can't be parsed purely by using regular expressions, precisely because of the nesting problems described earlier. But regexps can still be helpful in conjunction with other language constructs.

If you're really interested in the theory behind this, you can read up on the Chomsky Hierarchy, and you'll see why regular expressions represent a less powerful language than general mathematical expressions like the one mentioned above.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

I don't know what the real-world-problem behind the question is.
Perhaps you can solve it in two steps.

to produce:

100,to_date('18-Jan-2001'#'dd-mon-yyyy'),-6,0,1,0,1,'M','Male',10,'2','M'

and then split that.


http://home.arcor.de/hirnstrom/bewerbung
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by Dave Hewy:
Mmmm - I'm not sure what the link between languages and expressions is?

I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.


Regular expressions are a way to describe regular languages. Regular expression APIs use those descriptions to parse "sentences" in the described language.


Having said that, I'm not a regex expert, hence my original post!


It's a fascinating topic.
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
I ran up against this with a little macro language. Where I got lucky is that RegEx can easily find the balanced braces for the innermost nested macro. The macro processor replaces that with something else - plain text or maybe more macros. I keep finding and replacing the innermost until there ain't no mo.

You could do that here ... replace the parens and commas with some escape sequence. But just suggesting that makes me feel dirty.

BTW: If your data file quotes strings that have commas in them, you can do this with regex. Unescaped quotes must match to a depth of exactly one, nesting is not allowed. Look at the beginning of the first/next field. If it starts with a quote, take up to the next unescaped quote & comma or eol, otherwise take up to the next comma or eol. Rinse, repeat.
[ May 19, 2006: Message edited by: Stan James ]
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regular Expression To Parse CSV