Hi - I need some help to create a regular expression that will parse a line of comma separated values. The problem is that, some of the values have embedded commas that I want to ignore. Here's an example...
Look at the value starting to_date...., I don't want the embedded comma to act as a value separator.
Can anyone help with a Regex I can use on String.split to get an array of the values?
Thanks
Dave.
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35257
7
posted
0
Unless the point is specifically to do this with regexps, why not use a library that reads CSV files, like this one? CSV has a number of edge cases you need to consider, and before you have implemented all those, you're probably done using ready-made code.
Yes, I would like to do this with regex if possible, before I investigate other methods.
Ilja Preuss
author
Sheriff
Joined: Jul 11, 2001
Posts: 14112
posted
0
I'm quite sure that it's not possible.
The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Stan James
(instanceof Sidekick)
Ranch Hand
Joined: Jan 29, 2003
Posts: 8791
posted
0
I think Ilja can say that with confidence because the input you provided is not a "regular language" and cannot be parsed by regular expressions. This distinction gets way over my head in language theory but the shortest and most applicable tip I could find is: "... a language that allows parenthesized expressions, but requires the parentheses to balance, cannot be a regular language, and so the language cannot be generated by a regular grammar ..." from WikiPedia.
I don't think there is a true standard for CSV, but your to_date expression should probably be inside quotes. CSV generators and parsers I've used (eg Excel) use the quotes to know they should ignore the comma in the middle.
"to_date('18-Jan-2001','dd-mon-yyyy')"
Does that give you some ideas on how to parse this stuff? Be sure to consider strings with quotes inside them, too!
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
I agree. Regular expressions isn't able to match expressions where you have to keep track of matching closing braces to unlimited depth. Heck, even if you limit the depth, it can get ridiculously complicated.
For example, if I limit the depth to only one set of "()" pairs, the regex becomes...
This should work for your example string, but will fail, if the "to_date" function contains an another function as one of its parameters.
Henry [ May 18, 2006: Message edited by: Henry Wong ]
Also be aware that -despite of their name- CSV files can have semicolons instead of commas, and that strings can include newline characters and still be perfectly valid CSV (thus you can't just process the file line by line). Considering all this, go with a ready-made solution
Dave Hewy
Ranch Hand
Joined: Aug 21, 2003
Posts: 93
posted
0
Mmmm - I'm not sure what the link between languages and expressions is?
I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.
Having said that, I'm not a regex expert, hence my original post!
I can quite easily do this with Java, but I have a lot of these to parse and thought regex would probably be faster - but in reality, it probably won't make that much difference.
Can anyone help with a Regex I can use on String.split to get an array of the values?
Oh... the question is for the split() method (and not for the find() method).
For the split() method, it is not possible. The size of the parameters is not even fixed, so you can't even use a combination of zero-width negative look-aheads and look-behinds, to limit the scope of the commas.
I can quite easily do this with Java, but I have a lot of these to parse and thought regex would probably be faster - but in reality, it probably won't make that much difference.
Well, regular expressions do match most of the time, which makes parsing a breeze. For other cases, regular expressions can be used to match tokens during parsing, making it very easy to write a parser.
Just because the regex engine couldn't match with a single expression, doesn't mean you have to write a parser without it.
Henry
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35257
7
posted
0
Mmmm - I'm not sure what the link between languages and expressions is?
I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.
Mathematical expressions can't be parsed purely by using regular expressions, precisely because of the nesting problems described earlier. But regexps can still be helpful in conjunction with other language constructs.
If you're really interested in the theory behind this, you can read up on the Chomsky Hierarchy, and you'll see why regular expressions represent a less powerful language than general mathematical expressions like the one mentioned above.
Originally posted by Dave Hewy: Mmmm - I'm not sure what the link between languages and expressions is?
I thought most things that fit into some kind of pattern (and this does) could, or should be able to be parsed with regular expressions.
Regular expressions are a way to describe regular languages. Regular expression APIs use those descriptions to parse "sentences" in the described language.
Having said that, I'm not a regex expert, hence my original post!
It's a fascinating topic.
Stan James
(instanceof Sidekick)
Ranch Hand
Joined: Jan 29, 2003
Posts: 8791
posted
0
I ran up against this with a little macro language. Where I got lucky is that RegEx can easily find the balanced braces for the innermost nested macro. The macro processor replaces that with something else - plain text or maybe more macros. I keep finding and replacing the innermost until there ain't no mo.
You could do that here ... replace the parens and commas with some escape sequence. But just suggesting that makes me feel dirty.
BTW: If your data file quotes strings that have commas in them, you can do this with regex. Unescaped quotes must match to a depth of exactly one, nesting is not allowed. Look at the beginning of the first/next field. If it starts with a quote, take up to the next unescaped quote & comma or eol, otherwise take up to the next comma or eol. Rinse, repeat. [ May 19, 2006: Message edited by: Stan James ]