aspose file tools*
The moose likes Performance and the fly likes performance issue with substring method Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Performance
Bookmark "performance issue with substring method" Watch "performance issue with substring method" New topic
Author

performance issue with substring method

saravanan s
Greenhorn

Joined: Nov 05, 2001
Posts: 5
Hi,
I have to extract a substring from a large string.So i am using substring method.But it is taking lot of time.Is there any other alternative way to deal this issue.
Regards
saran
Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
Originally posted by saravanan s:
I have to extract a substring from a large string.So i am using substring method. But it is taking lot of time.
Actually, String.substring() is blazingly fast. The only thing that would be faster still is to avoid object creation completely by keeping track of the substring start and end positions in variables.
Are you sure it's the substring() operation that is the bottleneck, and not string searching or garbage collection?
- Peter
saravanan s
Greenhorn

Joined: Nov 05, 2001
Posts: 5
Thank u very much for your reply.
Yes.
Only substring method is taking lot of time.Coz i have to process some 1 million records.
The statement that is consuming lot of time is
tempStr = tempStr.append(rLine.substring(n, ilength)).append(value).append(endelements[k]);
Here rLine string's length is 20000 bytes.So i have to retrieve element at a particular position.
Regards
Saran
Originally posted by Peter den Haan:
Originally posted by saravanan s:
[b]I have to extract a substring from a large string.So i am using substring method. But it is taking lot of time.
Actually, String.substring() is blazingly fast. The only thing that would be faster still is to avoid object creation completely by keeping track of the substring start and end positions in variables.
Are you sure it's the substring() operation that is the bottleneck, and not string searching or garbage collection?
- Peter[/B]

Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
Originally posted by saravanan s:
tempStr = tempStr.append(rLine.substring(n, ilength)).append(value).append(endelements[k]);
Are you sure it's this line that takes so much time? Sure as in "profiler"? Because there's nothing here that takes an awful amount of time, except possibly the toString conversion of "value" and "endelements[k]" if they are objects with complex or inefficient toString() implementations.
What may take a significant amount of time is garbage collection. The substring() method creates a String object for each invocation (backed by the rLine char array). Unless they are already Strings, conversion of "value" and "endelements[k]" will create intermediate String objects as well. You may be creating more intermediate Strings in the surrounding code, which means millions of little objects all in all - this may well swamp the garbage collector.
Unless a profiler has shown the single line quoted above to be the bottleneck, and nothing else, I'd like to see the rest of your loop.
- Peter
[This message has been edited by Peter den Haan (edited November 05, 2001).]
saravanan s
Greenhorn

Joined: Nov 05, 2001
Posts: 5
Hi,
This is my code.I don't know what is wrong with this.

Regards
Saran

[This message has been edited by Jim Yingst (edited November 05, 2001).]
Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
Originally posted by saravanan s:
This is my code.I don't know what is wrong with this.
Well, I don't know either (as in "profiler" ) but I can spot a few things that are likely to be far more important than the line of code you showed. I hope others will be able to shed more light.
What I'll do is first highlight a few pieces of code and then try to look at the bigger picture (from low importance to high importance).
Surely the m2 search should start at position m1 + 10, and that only if m1 >= 0. This is temporary string #1, by the way.
Temporary string #2 (something proportional to elements.length times).
Yikes! That array needs to be allocated and initialised every time. To wipe your StringBuffer, don't null it, delete(0, tempStr.length()) it.
Temporary strings #3 and #4 (something proportional to elements.length times). If your file contains 1 million lines, and there are a few million tags, you're creating millions upon millions of temporary strings.
I'm not entirely sure what is more of a bottleneck, the object creation, the giant StringBuffer, or the exhaustive string searches you're doing. My first line of attack, however, would be the algorithm. Unless the elements array contains just two or three elements, there's a lot to be gained there. Isn't there some way you can recognize elements? Are they words? Are they tags? If so, then you can loop through your line in search of possible elements, isolate each candidate and try to find it in a HashMap with the elements. (Take care not to go overboard on object creation though). There's a good chance a single run through rLine will suffice to process all elements - it depends a bit on what the code is supposed to do. I must confess I find it pretty hard to see what that is, OK it's obviously looking for tags and doing things with their contents, but...
Anyway, I hope this helps.
- Peter
saravanan s
Greenhorn

Joined: Nov 05, 2001
Posts: 5
Hi peter,
Thank you very much for your reply.
I have tried your suggestion of removing the content of string buffer instead of assigning null.But it is degrading the performance! I don't know why?
My objective of this program is:
There are some 150 tags are there in each record and i have to remove some 80 tag's values leading zero's.For that job i am using the above code.The average record length is 20000 bytes and number of records to process is 10 million.To process 5000 records the above code is taking 114 seconds.Is there any way or code to optimize the performance or is there any algorithm is available.
Thanks for your valuable suggestion peter.
Regards
saran
Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
Instead of StringBuffer.delete(), I should've said StringBuffer.setLength(0). Not sure what I was thinking yesterday. Not that delete() should be slow, but setLength() is cleaner and a bit faster still.
Ah, someone helpfully revived an old thread that jogged my memory. It was not such a hot idea after all, sorry. (You can prevent the performance decrease by using tempStr.substring(0) instead of tempStr.toString(), but the net gain from the entire operation will be pretty limited, if anything). Read the thread if you'd like to understand why.
Anyway, you have 10 million records, 150 tags, 80 of which you you must recognise, and the purpose is to delete the leading zeros of the numeric values contained in the tags. 1,500,000,000 iterations of your for loop, half a dozen temporary objects per iteration... good grief. I'm sure this can be sped up considerably. Let me have a think.
- Peter

[This message has been edited by Peter den Haan (edited November 06, 2001).]
Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
The clue to improving performance is to get rid of your loop over the tags (elements). Unfortunately you didn't tell what your tags look like or how they are structured. For simplicity I'm going to assume that a tag starts with '<' and ends with '>', cannot contain '>' itself (not even quoted or escaped), and that the tags you're interested in aren't nested.The "tagHandlerMap" is a HashMap that maps tag names (e.g. "STRIPTHIS" for the "<STRIPTHIS>" tag) to a TagHandler object that does whatever processing needs doing. An implementation that recognises an end tag conforming to the XML convention (eg "</STRIPTHIS>"), and
strips leading zeroes from its content, could readThe constructor takes the name of the tag the handler is for. This handler only processes the content if it can find the corresponding end tag. For each tag you need to process, you instantiate a handler and put it in the tagHandlerMap.
The code outlined above (warning: improvised, so it won't work first time and probably won't even compile) should perform a lot better. It only performs a single run over rLine and creates fewer intermediate objects in each iteration. And if you need to perform different types of processing the in the future it's also a lot easier to enhance. You can shave off a few more CPU cycles but probably only at the expense of legibility.
I made a few assumptions that may or may not be true. If tags can be nested, you need to replace "sb.append(rLine.substring(pos, endTagStart));" by a recursive call to the tag processing code. If end tags do not follow the XML convention, it is straightforward to modify the tag handler constructor to take arbitrary start and end tags. If tags are actually words delimited by whitespace or something else, you have to chop up your rLine according to whatever the delimiters are. If tags don't have clear delimiters, all is not lost either (but I'll go into that in a follow-up article if necessary).
Hope this does give you some ideas if nothing else.
- Peter
[This message has been edited by Peter den Haan (edited November 06, 2001).]
saravanan s
Greenhorn

Joined: Nov 05, 2001
Posts: 5
Hi peter,
Thank u very much for your effort.I tried the option of setting tempStr.setLength(0);
but it is not giving performance like new StringBuffer() but it is not degrading performance like delete(0,length).
Very good suggestion peter.I will try your logic and i will let u know the result peter.Would u please give me your e-mail id peter so that it would be helpful for me to get valuable suggestions.
Thankyou very much for your suggestions peter.
Regards
Saran
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: performance issue with substring method
 
Similar Threads
string assignment
pet peeves 2
Parsing XML string using Java Program
subString
removing the trailing space of a string only