GeeCON Prague 2014*
The moose likes Java in General and the fly likes String.split Vs StringTokenizer Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Java in General
Bookmark "String.split Vs StringTokenizer" Watch "String.split Vs StringTokenizer" New topic
Author

String.split Vs StringTokenizer

Karthik Veeramani
Ranch Hand

Joined: Dec 22, 2002
Posts: 132
Any idea if jdk 1.4's String.split() method is faster than the traditional
StringTokenizer? I'm skeptical about using the split() and replaceAll()
methods as I have a feeling they might compile the regular expression everytime, which is an expensive operation.

Please advice.


Thanks<br />Karthik<br />SCJP 1.4, CCNA.<br /> <br />"Success is relative. More the success, more the relatives."
sander hautvast
Ranch Hand

Joined: Oct 18, 2002
Posts: 71
i guess you're right about the compiling:
source for String (jdk1.4.1) says:

public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
}
Karthik Veeramani
Ranch Hand

Joined: Dec 22, 2002
Posts: 132
StringTokenizer, from what I've heard, is very inefficient. I want to know how it compares with this split method... Even if split compiles the regex everytime, I'm OK if its faster than tokenizer.
Blake Minghelli
Ranch Hand

Joined: Sep 13, 2002
Posts: 331
Why don't you try some performance tests on the 2 options?
Personally, I hate that with StringTokenizer, if you have 2 delimiters back-to-back (e.g. "1,2,,3") then the empty element gets completely ignored. I believe String.split() does not have that problem, but I've never actually used it.


Blake Minghelli<br />SCWCD<br /> <br />"I'd put a quote here but I'm a non-conformist"
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12791
    
    5
What is this
from what I've heard
stuff - where do people hear things like this and why do you believe it?

Look at the source code for StringTokenizer - it looks pretty simple to me, give the flexibility it provides.

Seems to me that if you REALLY want to know which is faster you could write a little test program using data similar to your usual data and set of separators, and run it. Be sure to do some "warmup" loops so that JIT has had time to optimize.
Bill
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Blake:

public StringTokenizer(String str,
String delim,
boolean returnDelims)

Constructs a string tokenizer for the specified string. All characters in the delim argument are the delimiters for separating tokens.

If the returnDelims flag is true, then the delimiter characters are also returned as tokens. ...

src: javadocs.
I still don't know why the docs discourage usage of StringTokenizer.

(well - I could google, and so I will do...)
[ May 27, 2004: Message edited by: Stefan Wagner ]

http://home.arcor.de/hirnstrom/bewerbung
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12791
    
    5
I'm looking at the JavaDocs for java.util.StringTokenizer right now and I don't see anything discouraging the use.
Bill
Tim West
Ranch Hand

Joined: Mar 15, 2004
Posts: 539
(Gah, I just realised my entire post is redundant. Still, I'll leave it here)

I think Stefan's referring to this, from the StringTokenizer JavaDoc:


StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.


I have no idea why really, aside from redundancy - I don't think StringTokenizer does anything String.split() doesn't do...


--Tim
[ May 27, 2004: Message edited by: Tim West ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
The split() method can do anything a StringTokenizer can, and more. No need for both tools really, so you might as well use the more powerful one. There are a couple other reasons to discourage StringTokenizer. One is that StringTokenizer is not very good for detecting empty fields, e.g. to interpret

    one|two||four

as

    { "one", "two", "", "four" }

With split() this is easy. With StringTokenizer it's possible, thanks to the returnDelims parameter (as pointed out by Stefan) - but it's still a bit difficult. You need several more lines of logic to say that two successive delims translate into an empty field. E.g.:


The split() method seems a lot simpler, to me. Though it does force users to learn about regex escape sequences. And note that I was able to specify that ten fields were expected, total, so the tenth empty field was reported as "" rather than a null - which can be convenient. The StringTokenizer code would require some extra logic if you need to try to access the tenth field.

Another common problem people have is that the want to use a delimiter of more than one character. E.g. they might see something like

    foo and bar and baz

and then try to use a StringTokenizer with delimiter string " and ". Except this doesn't work, because " and " means that space or a or n or d will be considered a delimiter, and the results will be:

    "foo", "b", "r", "b", "z"

rather than the intended

    "foo", "bar", "baz"

We can say this is the user's fault for failing to read the documentation for StringTokenizer before using it. But still, the way " and " is interpreted as a delimiter string is counterintuitive for most of us. And it would be nice if there were a way to handle multi-character words as delimiters. Again, the split() method handles this sort of thing easily.

For what it's worth, JDK 1.5 also offers the Scanner class, which offers the same basic functionality with a few more improvements. A Scanner makes it very easy to read from a file or other IO stream, and its API does not force you to load all the results into memory at once (which can be a problem if you're reading from a really big file). Plus it adds some methods giving you access to any groups matched in the regex you used as a delimiter, which gives you many more flexible options in text processing. For those who lament how many lines of code it takes them to process a simple text file in Java (as opposed to, say, Perl) - Scanner does a nice job of simplifying things.

BTW, for those of you familiar with the new for loop: check out this RFE. Basically, this enhancement would allow us to write

rather than

Iterable is the new interface that allows us to use the new for loop syntax with a given construct. Yeah, it's a minor point. But what was the point of making Scanner implement Iterator if it isn't going to be Iterable? Seems like an oversight; easy to fix at this point. Please vote for this bug if you agree. Assuming you haven't already used your 3 votes on more important things.


"I'm not back." - Bill Harding, Twister
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Thanks, Jim, for that deep information.

Of course I would say 'user fault'.
And of course I made the same mistake, when I was new to StringTokenizer.

But doesn't this build the community? You're burned from the same fire, and may show your injuries.
Well of course 'split' has it's own fire, since a newbie wouldn't read 8 pages of regex-syntax, when trying to understand split ("\\|") - but estimate a splitting around '|' and '\'.

Compactness seems to be a point, but Sun could decide to give StringTokenizer a 'toArray' or 'splitAll' - Method too, which returns an Array of Strings.
OK - I agree in advance, could but wouldn't.

The 'ST.nextElement'-Method looks very suspicious - hmmm.

In C there is a similar function 'strtok' - might be a kind of father for StringTokenizer.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12791
    
    5
Since the String split builds a new Pattern every time, it is bound to be slower than StringTokenizer. If you have alot of Strings to operate on, building the Pattern once and using the Pattern split() method would be the way to go for maximum speed.
Bill
ashraf karim
Greenhorn

Joined: Jun 09, 2009
Posts: 2
As StringTokenizer do not detect empty token, that sometimes becomes beneficial.
I was trying to parse a string like, " This is test " and suppose need only the words.
StringTokenize only return the strings/words.
But String.split("\\s+") still returns and empty token at the first.
any comments?
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18874
    
  40

ashraf karim wrote:any comments?


Thanks for the info.... but you do know that this topic is over 5 years old, right?

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Jill Iyer
Greenhorn

Joined: Jun 09, 2010
Posts: 8
How can we specify multiple delimiters???
Darryl Burke
Bartender

Joined: May 03, 2008
Posts: 4571
    
    5



luck, db
There are no new questions, but there may be new answers.
Jan Cumps
Bartender

Joined: Dec 20, 2006
Posts: 2501
    
    8

Henry Wong wrote:
Thanks for the info.... but you do know that this topic is over 5 years old, right?
...
Henry
Six years old now.


OCUP UML fundamental and ITIL foundation
youtube channel
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18874
    
  40

Jill Iyer wrote:How can we specify multiple delimiters???


For StringTokenizer, there is a constructor, with a parameter, that allows you to specify possible delimiter characters.

For String.split(), it takes a regular expressions -- which can be used to define everything from the very simpliest of patterns to the ridiculous complex. A list of possible delimiter characters falls under the simple category.

Henry
Sagy Drucker
Greenhorn

Joined: Dec 13, 2011
Posts: 3
i have run a few simle tests about string tokenizing

the result is conclusive:
StringTokenized is MUCH faster than regex, or String.split().

results:
for 1000 iterations on a large text:
StringTokenizer: 0:00:01.586 seconds

using pattern: 0:00:02.925 seconds

using string.split: 0:00:02.776
which makes sense, since the split uses the pattern regex.

hope this is useful.
Joanne Neal
Rancher

Joined: Aug 05, 2005
Posts: 3602
    
  15
Jan Cumps wrote:
Henry Wong wrote:
Thanks for the info.... but you do know that this topic is over 5 years old, right?
...
Henry
Six years old now.

Seven years old now


Joanne
Sagy Drucker
Greenhorn

Joined: Dec 13, 2011
Posts: 3
oh, true.
i didn't notice.
i read it while googling stringTokenizer...
well.. never too late
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7892
    
  21

Sagy Drucker wrote:the result is conclusive:
StringTokenized is MUCH faster than regex, or String.split().
results:
for 1000 iterations on a large text:
StringTokenizer: 0:00:01.586 seconds
using pattern: 0:00:02.925 seconds...

So you've just spent an hour (I reckon it would take me at least that to write a comprehensive test) to prove that String.split() would take 1.2 seconds longer to check a thousand large strings than a class whose use has now been discouraged for 4 releases (I checked back to 1.4.2).

Optimization is fun, but it's worth remembering that your time is more valuable than any old computer's. You might also want to check out my quote below.

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Sagy Drucker
Greenhorn

Joined: Dec 13, 2011
Posts: 3
i see your point, and you are correct.
but 2 things:
1. writing a few for loops with split and stringTokenizer, took me 5 minutes. (10 minutes top)

2. at my job, we need to process millions of millions of strings, so even if it saves us a little time, we might feel it in the long run.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7892
    
  21

Sagy Drucker wrote:1. writing a few for loops with split and stringTokenizer, took me 5 minutes. (10 minutes top)

Sounds like a fairly cursory test then.

2. at my job, we need to process millions of millions of strings, so even if it saves us a little time, we might feel it in the long run.

Hmmm. 20 minutes of computer time per million as against using a class that may well get deprecated? I think I'd let the machines chug a bit more myself, especially since this particular test is so...well...particular.

Between them, String.split(), java.util.regex.Pattern and java.util.regex.Matcher provide a lot more variety than you'll ever get out of StringTokenizer, and they also have the great advantage of being more familiar to new Java bods.

Winston
 
GeeCON Prague 2014
 
subject: String.split Vs StringTokenizer