This week's book giveaway is in the OCMJEA forum.
We're giving away four copies of OCM Java EE 6 Enterprise Architect Exam Guide and have Paul Allen & Joseph Bambara on-line!
See this thread for details.
The moose likes Java in General and the fly likes Splitting the String to get all characters Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Splitting the String to get all characters" Watch "Splitting the String to get all characters" New topic
Author

Splitting the String to get all characters

amit punekar
Ranch Hand

Joined: May 14, 2004
Posts: 512
Hello,
I need to split the String e.g. "AMIT" to get a String array that would contain all the characters (A,M,I,T).
I tried writing the regular expression to be used with String.split() method.
Here is the code that I tried


It works but with a small hitch, the first String in the array is EMPTY String.

Alternatively I can use the toCharArray() to get all the characters, but I want to work with the Strings instead of characters.

Thanks in advance,
Amit

James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781


amit punekar
Ranch Hand

Joined: May 14, 2004
Posts: 512
Hi James,
Thanks a lot.
Can you please help me understand it or point to the documentation that I can refer to ?
I have gone through various regex mentioned in the Java documentation but could not understand.

Thanks once again,
Amit
James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781

amit punekar wrote:Hi James,
Thanks a lot.
Can you please help me understand it or point to the documentation that I can refer to ?
I have gone through various regex mentioned in the Java documentation but could not understand.

Thanks once again,
Amit


Look at the Javadoc for Pattern and the section on 'negative look behind'. If you are in the early stages of working with regex then a good reference is here. If you are serious about learning about regular expression then buy the book "Mastering Regular Expressions" by Jeffrey Friedl published by O'Reilly. I can't give you the ISBN number since my copy is out way out of date (1997) but Google will find you the latest version.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19684
    
  20

You can use a negative lookbehind; check out the Javadoc of java.util.regex.Pattern: This is the same as yours except I explicitly said to ignore the position just after the start (^).
edit: I posted a lookahead, not lookbehind. Fixed

However, I suggest you still use toCharArray(), then convert it to String[]. That is simply more efficient. Just try the following code: On my system, split1 is easily 10 times faster than split2. That's because with split2, using String.split, you create a java.util.regex.Pattern and java.util.regex.Matcher object each single time. It uses a List<String> to store the intermediate results (using Stirng.substring to create new String objects), then converts that List<String> into a String[].>


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781

Rob Prime wrote:On my system, split1 is easily 10 times faster than split2. That's because with split2, using String.split, you create a java.util.regex.Pattern and java.util.regex.Matcher object each single time. It uses a List<String> to store the intermediate results (using Stirng.substring to create new String objects), then converts that List<String> into a String[].


So what happens to your benchmark result if you pre-compile the regex? And what happens if you perform make sure that the JIT has done it's job before doing the timing?

Adding code to your benchmark to cover both of these reduces the advantage on my machine to about a factor of 4. Still not good but "premature optimisation etc etc etc".
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19684
    
  20

James Sabre wrote:So what happens to your benchmark result if you pre-compile the regex?

Using Pattern.split takes off about 33% of the time but it's still 8 times slower.

And what happens if you perform make sure that the JIT has done it's job before doing the timing?

How would I do that? I've just re-ran the tests with the same long loops after the first loops, so that's 10 million iterations after already having run 10 million iterations, and the results are similar.

"premature optimisation etc etc etc".

I agree but if I can replace using a regex with two simple loops (one for toCharArray internally, one for the copying) I'll definitely do that.
James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781

Rob Prime wrote:
James Sabre wrote:So what happens to your benchmark result if you pre-compile the regex?

Using Pattern.split takes off about 33% of the time but it's still 8 times slower.

And what happens if you perform make sure that the JIT has done it's job before doing the timing?

How would I do that? I've just re-ran the tests with the same long loops after the first loops, so that's 10 million iterations after already having run 10 million iterations, and the results are similar.

"premature optimisation etc etc etc".

I agree but if I can replace using a regex with two simple loops (one for toCharArray internally, one for the copying) I'll definitely do that.


To make sure the JIT has done it's job you just run the loops for a bit without actually timing the result. I typically use about 10% so each of your loops then starts with something like


As far as replacing a regex with two simple loops is concerned. There we have a different approach. With a such a simple regex as this, unless time was critical, I would always prefer the one line solution.
Ireneusz Kordal
Ranch Hand

Joined: Jun 21, 2008
Posts: 423
James Sabre wrote:
Still not good but "premature optimisation etc etc etc".

Imagine how long it will take to understand a magic formula (?!^) by someone who will maintain your code in the future
and some day will must quickly fix a serious bug but will not be an expert in regular expressions.

Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19684
    
  20

Especially considering you've copied my error that uses a lookahead instead of lookbehind
James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781

Ireneusz Kordal wrote:
James Sabre wrote:
Still not good but "premature optimisation etc etc etc".

Imagine how long it will take to understand a magic formula (?!^) by someone who will maintain your code in the future
and some day will must quickly fix a serious bug but will not be an expert in regular expressions.



That is a simple regular expression so this does not wash as an argument. Using your argument one would never ever ever use anything except the most trival algorithms. One would use crude DFT rather than FFT. One would use brute force rather than Dijkstra when looking for shortest paths. One would use simple linear search rather than KMP.

I expect programmers to understand basic tools and I regard regex as a basic tool.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

James Sabre wrote:That is a simple regular expression so this does not wash as an argument. Using your argument one would never ever ever use anything except the most trival algorithms. One would use crude DFT rather than FFT. One would use brute force rather than Dijkstra when looking for shortest paths. One would use simple linear search rather than KMP.

I expect programmers to understand basic tools and I regard regex as a basic tool.


If two programmers who know regex quite well have to discuss un-simple topics like negative look-behind and go through several versions before coming up with a correct regex, then I wouldn't classify the regex as "simple".

And given the choice between a non-simple regex and calling the toCharArray() method of String, I would choose the latter regardless of what developers I expected to be maintaining the code in the future. This is one case when the most trivial algorithm is also the most appropriate, since it does what has to be done faster and more transparently than the more complex algorithm.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19684
    
  20

James Sabre wrote:I regard regex as a basic tool.

I think that's where we disagree on most. Regexes are very useful, true, but definitely not "basic". I've bought and read "Mastering Regular Expressions" and its writers too agree that regular expressions are far from a simple topic. There's a reason there are many books on regexes.
James Sabre
Ranch Hand

Joined: Sep 07, 2004
Posts: 781

Rob Prime wrote:
James Sabre wrote:I regard regex as a basic tool.

I think that's where we disagree on most. Regexes are very useful, true, but definitely not "basic". I've bought and read "Mastering Regular Expressions" and its writers too agree that regular expressions are far from a simple topic. There's a reason there are many books on regexes.


Is counting published books on a topic a good metric for the complexity of a topic? How many books are there on Java basics? I have just 3 and two of them are rubbish but I know that there are dozens out there. If there are more published books on elementary Java than on regex does that make elementary Java more complex than regex?

This is turning into a religious argument so I will bow out now.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19684
    
  20

I think that is a very good idea. Let's just agree that both solutions work so it's up to the developer to choose the one he wants.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

James Sabre wrote:Is counting published books on a topic a good metric for the complexity of a topic?


Far from it. It's a metric for the level of interest in a topic.

This is turning into a religious argument so I will bow out now.


I don't find it particularly religious but I do agree with Rob Prime's last post. We're done with answering Amit's question.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3012
    
  10
Ignoring the other issues raised, here's an even faster solution, similar to the toCharArray() but without the unnecessary copying of data:
amit punekar
Ranch Hand

Joined: May 14, 2004
Posts: 512
Hello,
Thank you James and Rob for your valuable inputs.
I would certainly say whichever path anyone choose to do this task would certainly get enlightened by this discussion thread.
Thank you very much once again and appreciate your time for letting me know other faces of the problem as well.

Thanks,
Amit
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

Could just skip the first array entry :/
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3012
    
  10
Mmm, I'm not following you there David. Why would the first entry be skipped?
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

Mike Simmons wrote:Mmm, I'm not following you there David. Why would the first entry be skipped?

From the original post:
It works but with a small hitch, the first String in the array is EMPTY String.

It's not all about you ;)
amit punekar
Ranch Hand

Joined: May 14, 2004
Posts: 512
Hi David,
I did it earlier to skip the first token, but then was trying to get elegant way handling this.

Thanks for the reply,
Amit
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Splitting the String to get all characters