This week's giveaways are in the MongoDB and Jobs Discussion forums. We're giving away four copies of Mongo DB Applied Patterns and 4 resume reviews from Five Year Itch and have the authors/reps on-line! See this thread and this one for details.
Hello... Does anyone have any suggestions on the fastest (and hopefully most efficient) way to parse a string? Let's say I have a string that is comma delimited, and I wanted to convert it to a Collection. Also, the elements in the string that are comma delimited are of unequal length. For example - item1,items22,item333,item55555 I was thinking of using an array of characters, but I don't know the speed implication of for loops versus creating sub-strings using String.substring(int,int). Any suggestions?
Ilja Preuss
author
Sheriff
Joined: Jul 11, 2001
Posts: 14112
posted
0
Use java.util.StringTokenizer - it's optimized for exactly this type of parsing. [ September 26, 2002: Message edited by: Ilja Preuss ]
The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Blake Minghelli
Ranch Hand
Joined: Sep 13, 2002
Posts: 331
posted
0
Just a warning about StringTokenizer if you have never used it before... The default behavior ignores empty "tokens". For example: "token1,token2,,token3" A StringTokenizer created on that string will return 3 tokens.
Blake Minghelli<br />SCWCD<br /> <br />"I'd put a quote here but I'm a non-conformist"
Jim Yingst
Wanderer
Sheriff
Joined: Jan 30, 2000
Posts: 18670
posted
0
If you really want the fastest parsing possible, you can probably improve on StringTokenizer a little bit, because StringTokenizer spends a little bit of time checking for multiple delimiters, and even checking to see if the set of delimiters has changed since the last time nextToken() was called. You can omit this for your situation, and thereby speed things up a bit, I imagine. But I doubt you'll see a big difference, so don't spend too much time on it unless you're sure performance is a real problem. I'd probably just store the input as a String, and use indexOf(',', startPos) to find delimiters, and substring(int, int) to create a String for each token. You could also store the input as a char[] array; I'm not sure if that will end up any faster or not. You'd have to try both ways and measure, I suppose. Now in terms of development speed (rather than execution speed), the easiest solution is probably String[] tokens = inputStr.split(","); Try it; you may well find it's already fast enough for you. (You need to be using SDK 1.4 though.) It also fixes the annoying "feature" of StringTokenizer which Blake mentioned.
"I'm not back." - Bill Harding, Twister
Ron Ditch
Ranch Hand
Joined: May 16, 2002
Posts: 33
posted
0
Thanks Jim, that's what I was looking for.
Thomas Paul
mister krabs
Ranch Hand
Joined: May 05, 2000
Posts: 13974
posted
0
You should keep in mind that StringTokenizer was designed to parse Java programs. The token to split on was assumed to be a space. The reason we have the default behavior of the StringTokenizer is that multiple spaces doesn't mean anything special in java source.
Hi, What if i want to parse records of a file? whouldn't the StringTokenizer be a killer?? I want to monitor a log file and reformat the records for the output based on a pattern submitted by a user.
Jim Yingst
Wanderer
Sheriff
Joined: Jan 30, 2000
Posts: 18670
posted
0
Tom's comment may be a bit misleading - it's possible to use StringTokenizer to parse a lot of things other than Java code. But it has a number of limitations - nowadays it's probably more powerful and flexible to learn how to parse using the classes in java.util.regex (at least, for anything more complicated than the split() method I showed above).
John Coffey
Greenhorn
Joined: Nov 11, 2002
Posts: 2
posted
0
I have some sample code to test out "log" parsing. It looks like StringTokenizer isn't too good as far as performance is concerned. Using jdk 1.4.1, I got the following results:
Can anyone come up with a faster version? Is there a better IO class? First a utility to create a big log file: