This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
Hi all, I hope I don't cause too much trouble with this one, but here goes:
I am reading in a large collection of files (let us say ASCII .txt files for the sake of this example.) For my i/o, I am using JNI to do buffered reading in C land and get back a collection of Strings representing the lines of the files. This is sufficiently fast for my purposes, so I am satisfied with this part of my solution for now.
The question then is how to parse/process these lines in Java. Currently, too much time is being spent processing the files. I need to split/tokenize by good old ascii 0x20 (space). The number of tokens on a line is the driving force in determing how long a line takes to process (duh!)
I have tried Pattern.split, String.split, an old class on the web called SimpleTokenizer, and the "legacy" StringTokenizer class. The StringTokenizer beats all the others hands down as far as speed goes for the task I am doing. With the number of files I have to process, there is no way I am going to use split even if it is considered proper Java.
I suppose my question is does anything faster than StringTokenizer exist. Way back when (2003), SimpleTokenizer supposedly beat StringTokenizer, but now in 2010 I am not finding that to be the case.
Just to throw out a code snippet, here is what I am doing: