File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Using string.split with any delimiter Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Soft Skills this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Using string.split with any delimiter" Watch "Using string.split with any delimiter" New topic
Author

Using string.split with any delimiter

Pat Short
Greenhorn

Joined: Mar 21, 2008
Posts: 22
Hi,

I need to use string.split() to tokenize a string. The problem is the delimiter can be any character or sequence of characters. What I've noticed is that some characters such as | or . perform incorrectly as the delimiting character. I understand that this is because they have a different meaning in regular expressions. It is easy to overcome this by just escaping these character with \\ however, I do not control the delimiter which could be submitted. I guess I have 2 questions

1) is there some method of string parsing that will ignore what character I use as the delimiter. I.e. if I use a | it will work out of the box without having to escape it.

2) is there a comprehensive set of characters that need to be escaped so that I can check for them?

Thanks
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 19060
    
  40

1) is there some method of string parsing that will ignore what character I use as the delimiter. I.e. if I use a | it will work out of the box without having to escape it.


Take a look at the java.util.regex.Pattern.quote() method. It will automatically escape any special regex meaning -- give you a new string that represents the original string as a literal.

2) is there a comprehensive set of characters that need to be escaped so that I can check for them?


With the quote method, you don't need to know the comprehensive set -- but you should learn regex regardless. Once you know regex, you'll know the set.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Garrett Rowe
Ranch Hand

Joined: Jan 17, 2006
Posts: 1296
1) is there some method of string parsing that will ignore what character I use as the delimiter. I.e. if I use a | it will work out of the box without having to escape it.


Instead of using:


You can use


Some problems are so complex that you have to be highly intelligent and well informed just to be undecided about them. - Laurence J. Peter
Pat Short
Greenhorn

Joined: Mar 21, 2008
Posts: 22
Nice one, thanks all
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Another option would be

myString.split(Pattern.quote(myDelimiter));


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Piet Verdriet
Ranch Hand

Joined: Feb 25, 2006
Posts: 266
One more option:



The "\\Q" tells the regex engine to treat 'myDelimiter' as a normal String.
This way, you can combine your normal text with regex meta characters in one String by adding "\\E" after your 'myDelimiter':



In the example above, the '+' will be the regex meta character "one or more times" and 'myDelimiter' is "quoted".
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 19060
    
  40

In the example above, the '+' will be the regex meta character "one or more times" and 'myDelimiter' is "quoted".


Be careful with using "\\Q" and "\\E" quoting. These do *not* nest. So... if you have "\\Q" and "\\E" in your orig delimiter regex, it won't work properly.

If you use Pattern.quote(), it will take care of the "\\Q" and "\\E" in your regex too. So, it is probably a better choice.

Henry
Piet Verdriet
Ranch Hand

Joined: Feb 25, 2006
Posts: 266
Originally posted by Henry Wong:


Be careful with using "\\Q" and "\\E" quoting. These do *not* nest. So...


I was not aware of that: good to know.

Thanks.
[ September 29, 2008: Message edited by: Piet Verdriet ]
Pat Short
Greenhorn

Joined: Mar 21, 2008
Posts: 22
Thanks for all your helps, really helpful. However, I've run into another issue when I use a tab delimiter "\t" This was working with the previous string.split(del) method. Now with the Pattern.quote it fails and the tab delimiter is not picked up. So I fix problems with | and . but now I break the tab delimiter. Any idea, help greatly appreciated.

Thanks
Piet Verdriet
Ranch Hand

Joined: Feb 25, 2006
Posts: 266
Originally posted by Pat Short:
Thanks for all your helps, really helpful. However, I've run into another issue when I use a tab delimiter "\t" This was working with the previous string.split(del) method. Now with the Pattern.quote it fails and the tab delimiter is not picked up. So I fix problems with | and . but now I break the tab delimiter. Any idea, help greatly appreciated.

Thanks


Well, the best I can do is say "you did something wrong", since you didn't provide an example of what you mean exactly.

"It" works:

Pat Short
Greenhorn

Joined: Mar 21, 2008
Posts: 22
Yes and no!

Try this



where args[0] is \t passed in from the command line. The difference in behavior here is what is causing my problem. I want it to work with Pattern.quote so I can use other delimiters but its not so simple. Any ideas.

thanks
Piet Verdriet
Ranch Hand

Joined: Feb 25, 2006
Posts: 266
Originally posted by Pat Short:
Yes and no!


Yes and yes.

Originally posted by Pat Short:
...
where args[0] is \t passed in from the command line
...


Your shell (command prompt) will let the String "\t" through, not the tab character, "\t" is only a tab inside a String literal.
[ October 01, 2008: Message edited by: Piet Verdriet ]
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Using string.split with any delimiter