File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Using string.split with any delimiter

 
Pat Short
Greenhorn
Posts: 22
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I need to use string.split() to tokenize a string. The problem is the delimiter can be any character or sequence of characters. What I've noticed is that some characters such as | or . perform incorrectly as the delimiting character. I understand that this is because they have a different meaning in regular expressions. It is easy to overcome this by just escaping these character with \\ however, I do not control the delimiter which could be submitted. I guess I have 2 questions

1) is there some method of string parsing that will ignore what character I use as the delimiter. I.e. if I use a | it will work out of the box without having to escape it.

2) is there a comprehensive set of characters that need to be escaped so that I can check for them?

Thanks
 
Henry Wong
author
Marshal
Pie
Posts: 20834
75
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
1) is there some method of string parsing that will ignore what character I use as the delimiter. I.e. if I use a | it will work out of the box without having to escape it.


Take a look at the java.util.regex.Pattern.quote() method. It will automatically escape any special regex meaning -- give you a new string that represents the original string as a literal.

2) is there a comprehensive set of characters that need to be escaped so that I can check for them?


With the quote method, you don't need to know the comprehensive set -- but you should learn regex regardless. Once you know regex, you'll know the set.

Henry
 
Garrett Rowe
Ranch Hand
Posts: 1296
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
1) is there some method of string parsing that will ignore what character I use as the delimiter. I.e. if I use a | it will work out of the box without having to escape it.


Instead of using:


You can use
 
Pat Short
Greenhorn
Posts: 22
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nice one, thanks all
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Another option would be

myString.split(Pattern.quote(myDelimiter));
 
Piet Verdriet
Ranch Hand
Posts: 266
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
One more option:



The "\\Q" tells the regex engine to treat 'myDelimiter' as a normal String.
This way, you can combine your normal text with regex meta characters in one String by adding "\\E" after your 'myDelimiter':



In the example above, the '+' will be the regex meta character "one or more times" and 'myDelimiter' is "quoted".
 
Henry Wong
author
Marshal
Pie
Posts: 20834
75
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
In the example above, the '+' will be the regex meta character "one or more times" and 'myDelimiter' is "quoted".


Be careful with using "\\Q" and "\\E" quoting. These do *not* nest. So... if you have "\\Q" and "\\E" in your orig delimiter regex, it won't work properly.

If you use Pattern.quote(), it will take care of the "\\Q" and "\\E" in your regex too. So, it is probably a better choice.

Henry
 
Piet Verdriet
Ranch Hand
Posts: 266
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Henry Wong:


Be careful with using "\\Q" and "\\E" quoting. These do *not* nest. So...


I was not aware of that: good to know.

Thanks.
[ September 29, 2008: Message edited by: Piet Verdriet ]
 
Pat Short
Greenhorn
Posts: 22
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for all your helps, really helpful. However, I've run into another issue when I use a tab delimiter "\t" This was working with the previous string.split(del) method. Now with the Pattern.quote it fails and the tab delimiter is not picked up. So I fix problems with | and . but now I break the tab delimiter. Any idea, help greatly appreciated.

Thanks
 
Piet Verdriet
Ranch Hand
Posts: 266
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Pat Short:
Thanks for all your helps, really helpful. However, I've run into another issue when I use a tab delimiter "\t" This was working with the previous string.split(del) method. Now with the Pattern.quote it fails and the tab delimiter is not picked up. So I fix problems with | and . but now I break the tab delimiter. Any idea, help greatly appreciated.

Thanks


Well, the best I can do is say "you did something wrong", since you didn't provide an example of what you mean exactly.

"It" works:

 
Pat Short
Greenhorn
Posts: 22
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Yes and no!

Try this



where args[0] is \t passed in from the command line. The difference in behavior here is what is causing my problem. I want it to work with Pattern.quote so I can use other delimiters but its not so simple. Any ideas.

thanks
 
Piet Verdriet
Ranch Hand
Posts: 266
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Pat Short:
Yes and no!


Yes and yes.

Originally posted by Pat Short:
...
where args[0] is \t passed in from the command line
...


Your shell (command prompt) will let the String "\t" through, not the tab character, "\t" is only a tab inside a String literal.
[ October 01, 2008: Message edited by: Piet Verdriet ]
 
It is sorta covered in the JavaRanch Style Guide.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic