aspose file tools*
The moose likes Java in General and the fly likes Regex split text around quotes Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regex split text around quotes" Watch "Regex split text around quotes" New topic
Author

Regex split text around quotes

Sugantha Jeevankumar
Ranch Hand

Joined: Jun 06, 2007
Posts: 93
Hi All.. I need to do a regex split on a string like "aaa bbb 'ccc ddd' eee 'fff' ggg" and split this into the following,



i.e I need to split the text within the single quote character. I had attempted this with (zero width positive look-behind for ') OR (zero width positive look ahead for '), i.e., "(?<=')|(?=')" but the result it gives me is this,



Previously, I had posted a similar question, http://www.coderanch.com/t/592655/java/java/Regex-split-characters-return-delimiters where I am splitting around an open brace and a closed brace. When a similar logic is applied here, I understand that the same single quote is being matched by both the look-behind and the look-ahead, making it appear as a separate token. Can you please point me on how to get my kind of output ? Thanks in advance.


SCJP 5.0
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1053
    
  10

Are you insistent on using split() ? This is relatively easy to do with Matcher.find() but I can't create a regex that will do it with split(). Others may do better.
Sugantha Jeevankumar
Ranch Hand

Joined: Jun 06, 2007
Posts: 93
@Richard
Thanks for the idea.. I tried the following code to match the text between single quotes,



For this, I get,



"aaa bbb", "eee" and "ggg " do not get returned. Is there a way to return the non-matched string via the Matcher class in any way?
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1053
    
  10

Your regex is similar to mine but you need more terms to look for :-


Edit : please ignore this, it does not do what you specified.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18825
    
  40

Richard Tookey wrote:Are you insistent on using split() ? This is relatively easy to do with Matcher.find() but I can't create a regex that will do it with split(). Others may do better.


To do it with split(), try ...



Henry
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7779
    
  21

Sugantha Jeevankumar wrote:Previously, I had posted a similar question, http://www.coderanch.com/t/592655/java/java/Regex-split-characters-return-delimiters where I am splitting around an open brace and a closed brace.

If you're planning on combining these things at some point, then it sounds to me like you're writing a parser; and regex is not really suited for something like that (except perhaps for individual searches).

Also: have you considered the possibility of:
String str ="aaa bbb 'ccc 'ddd eee' fff' ggg";
(ie, 'embedded' quotes)?

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Sugantha Jeevankumar
Ranch Hand

Joined: Jun 06, 2007
Posts: 93
@Henry
Thanks a lot. That works. Although I would like to know how the output differs between


I see that only the ordering of the lookbehind and lookahead around the '|' differs. Can you please explain the regex.

@Winston
Yes, Though it begins to look a little complex, the complexity ends there. So I think so far, regex does nicely. And there is no possibility of nested quotes in the input.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7779
    
  21

Sugantha Jeevankumar wrote:Yes, Though it begins to look a little complex, the complexity ends there. So I think so far, regex does nicely.

I presume that 'yes' means that you are planning on combining them and, if that's the case, I fear you're in for a nasty surprise.

Regex is a pattern-matcher, and it's very good at that. In fact, it's possibly too good, because people start thinking that it can do all sorts of things that it was never designed for; and parsing - particularly if it involves levels or conditional logic - is one of those things.

The classic case is using regexes to search HTML or XML: It seems like a great idea at first, but you'll soon run into problems. In your case, your "text" appears to be some sort of rudimentary language or expression, and I suspect that you will run into difficulties if you rely solely on regexes to do the work.

And there is no possibility of nested quotes in the input.

OK, but what about input errors? What if somebody forgets to close a set of quotes? Programs that assume that data is correct can exhibit very nasty behaviour when it isn't.

HIH

Winston
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38765
    
  23
Regexes are useful for regular grammars (using the 4‑category classification of Noam Chomsky’s). For context‑free grammars, you can try automatic parsing tools like lex, yacc and their descendants. A regex cannot parse a grammar which is not regular.
For context‑sensitive grammars, you can try hand‑crafted parsers (but by this time it is getting bl**d* difficult), or ANTLR …


… and for free grammars …

Well, look at Google Translate. Look at this line, which I sang back in the Summer, from la Traviata by Giuseppe Verdi, and translate it from Italian to English. I am sure “peni” is Italian for pains griefs or sufferings, and what I think it means follows.
Quanto peni, fa cor!
Such sorrows, take heart!
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18825
    
  40

Sugantha Jeevankumar wrote:@Henry
Thanks a lot. That works. Although I would like to know how the output differs between


I see that only the ordering of the lookbehind and lookahead around the '|' differs. Can you please explain the regex.



The ordering actually doesn't matter in this case. The main reason that the ordering is different is because I didn't read your post closely. If you like, you can change the ordering to "(?<=') | (?=')", and it would still work.


So, what is the difference?

With my regex, I am using a space as the delimiter. However, it is not just any space, the space must have a single quote, either before or after, meaning adjacent to it.

With your regex, you are using an EMPTY STRING as the delimiter. The only requirement is that this zero length delimiter must have a single quote adjacent to it. So... in your case, the delimiter are before and after the single quotes. This is why you have single quotes as elements. This is also why (although you probably didn't notice), you have an extra space before and after some of your other elements.

Henry
dennis deems
Ranch Hand

Joined: Mar 12, 2011
Posts: 808
Campbell Ritchie wrote:Look at this line, which I sang back in the Summer, from la Traviata by Giuseppe Verdi, and translate it from Italian to English.

I'm impressed (and bitterly envious)!
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3014
    
  10
Campbell Ritchie wrote:Well, look at Google Translate. Look at this line, which I sang back in the Summer, from la Traviata by Giuseppe Verdi, and translate it from Italian to English. I am sure “peni” is Italian for pains griefs or sufferings, and what I think it means follows.
Quanto peni, fa cor!
Such sorrows, take heart!

I wouldn't blame Google Translate there, so much as changes in Italian and variations among dialects. In modern Italian "pena" would be pain or sorrow, and it wouldn't normally have a plural, but if it did it would be "pene". Further the "quanto" should be changed to match - either "quanta pena" or "quante pene". Wheras "peni" means, well, exactly what Google siad, but again, the "quanto" should be changed to match - either "quanto peno" or "quanti peni", singular or plural. I'm not sure you want to sing in an opera containing this line, however.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38765
    
  23
I quoted that as an example of how hard it is to parse free grammars, as used in natural languages. I am indebted to MS for pointing out that there have probably been changes since Joe Green wrote that line. Also it was (I think) an Augener’s copy, so it may have suffered by being printed by Germans, who don’t speak fluent Italian.
Sugantha Jeevankumar
Ranch Hand

Joined: Jun 06, 2007
Posts: 93
Thanks all. Though I dont speak Italian at all, I did get your explanation. I have already decided to ditch the 'use-regex-entirely' approach, and working towards writing a manual parser, that takes the help of a little regex now and then.

@Winston
Like you said, I am trying to parse an expression with a lot of braces, quotes and other characters. One positive is, this is not free flowing text. These are pre-defined expressions (written by the developer), that are read in from XML files. And I will be trying to handle misbehaving expressions with nested quotes etc in the parser.


Thanks all for your help.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38765
    
  23
As I said, it is an example of how difficult natural languages are to parse. But we all do it in our heads almost automatically.
dennis deems
Ranch Hand

Joined: Mar 12, 2011
Posts: 808
Campbell Ritchie wrote:I quoted that as an example of how hard it is to parse free grammars, as used in natural languages. I am indebted to MS for pointing out that there have probably been changes since Joe Green wrote that line. Also it was (I think) an Augener’s copy, so it may have suffered by being printed by Germans, who don’t speak fluent Italian.

Surely the credit goes to Piave? Joe was particular about the words he set, but not, I don't think, a wordsmith himself.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3014
    
  10
I'm thinking that both Joe and Frank (P) were native speakers of Italian, and whether wordsmiths or not, "quanto peni" would stand out as blatantly wrong to both them. A transcription error by a German seems much more likely to me. But it's also possible that the language has changed, or a local dialect was different, or there's some other subtlety of spelling or grammar at work here which I, as a non-native, am unfamiliar with.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38765
    
  23
And all those possible explanations just go to underline my point, how difficult it is to translate natural languages with a “free” grammar.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3014
    
  10
As you said, yes, got that point the first time. Though your example is one that is hard for humans too, because it's either a transcription error, or archaic. Machine translation is inherently hard, even for things we consider simple. This isn't one of them.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38765
    
  23
The translation wasn’t at all hard when we sang it; it was an interlinear version with English and Italian printed together
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regex split text around quotes