GeeCON Prague 2014*
The moose likes Performance and the fly likes String find and replace Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Performance
Bookmark "String find and replace" Watch "String find and replace" New topic
Author

String find and replace

B M Tejaswi
Greenhorn

Joined: Jul 02, 2012
Posts: 11
Am trying to write code which replaces the words in really large input string with the encoded word.The encoded pair is given. I would like you guys to give me some inputs as to how it can be done better(optimization). Below is the code


Input string is :
This component is standard compliant|3rdparty api fails===component=C|standard=S|3rdparty=3|compliant=I

output :
This C is S I|3 api fails

string after '===' is the encoding pairs from which the LHS of '===' is encoded.

Any suggestions?
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14193
    
  20

Can you isolate the problem? It's a lot of work to try to understand all the code you posted and things are missing, such as your Constants class, which would be especially interesting since you're passing values that are defined in there to String.split(...).

Can you give a concrete example of where String.split(...) doesn't do what you expected? What are the exact inputs and outputs?


Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 8 API documentation
B M Tejaswi
Greenhorn

Joined: Jul 02, 2012
Posts: 11
Thanks for the reply.
AM able to achieve the expected output given the input..the only problem is how it can be optimized/how it can be further enhanced so that the runtime might reduce given a very long input string.
Below is the code snippet for Constants class:
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Tejaswi Bm wrote:Thanks for the reply.
AM able to achieve the expected output given the input..the only problem is how it can be optimized/how it can be further enhanced so that the runtime might reduce given a very long input string.


Have you measured the performance and found that it doesn't meet your already-defined, specific requirements? If not, then there's no point in trying to optimize it. You'd just be guessing at what might be wrong and trying to make it "better" without having an actual target.

Having said that, one thing does pop to mind. I'm not going to read your code, but if you're using String.split() or String.replaceAll(), you're compiling the regex every time. If you're reusing the same regex then you might get better performance by calling Pattern.compile() once, reusing the same Pattern each time, and then creating a Matcher for each input String and calling its methods as needed.

B M Tejaswi
Greenhorn

Joined: Jul 02, 2012
Posts: 11
Thanks Jeff!!
But i dont think my code is optimized and can handle really huge input string. I have used a HashMap to store the tokens, but i feel the same can be achieved with 'something else'. I am not aware of that 'something else'. So if you could point me in right direction??
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

My first thought would be to throw out your replace() method and simply use the replace() method which is already built into the String class.

Quite often people will write their own code rather than use built-in methods, disregarding the fact that the built-in methods were written by people with more experience and have been improved over the years. Often that's a mistake. At any rate you now ought to compare your code to the built-in code and see which is more "optimized" (by which it looks like you mean runs faster, rather than uses less memory or something else).
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Tejaswi Bm wrote:Thanks Jeff!!
But i dont think my code is optimized and can handle really huge input string. I have used a HashMap to store the tokens, but i feel the same can be achieved with 'something else'. I am not aware of that 'something else'.


Why? In what way does HashMap not meet your needs, or in what way does it seem inappropriate to you? There may be something better, but I can't recommend anything without a clear picture of how HashMap is falling short.

If you're associating a key with a value, Map is the way to go. If you don't care about the iteration order of the keys, values, or pairings, then HashMap is the default workhorse Map implementation to choose.
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11351
    
  16

Tejaswi Bm wrote:I don't think my code is optimized and can handle really huge input string.

without actual evidence - actual measurements of how long things take - what you "think" is irrelevant. this kind of this has been discussed ad nauseam in the performance forum (where this thread really belongs).

Never try and do optimization without measuring your performance. You will ALWAYS be served better writing clean, readable, understandable code first. OK...second. FIRST you should define what your performance requirements are. Then write your clean, readable, understandable code. Then test it to see if it meets your requirements.

If it does - you're done.

If not, use a profiler to find out exactly where it is slow, and spend your time optimizing that.


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12791
    
    5
Personally I would read the input String as a stream, parsing out words and writing to an output stream - for each word, do a hash table lookup to see if it has a replacement - if so substitute in the output stream. Completely insensitive to problems of string replace in a really large string in memory.

Things to watch out for - case differences, word separators.

Bill
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7892
    
  21

Tejaswi Bm wrote:But i dont think my code is optimized and can handle really huge input string. I have used a HashMap to store the tokens, but i feel the same can be achieved with 'something else'. I am not aware of that 'something else'. So if you could point me in right direction??

Well, like the others I suspect you're obsessing about something which you don't yet know is a problem.

However, if you're absolutely bound and determined to research this problem as an exercise. I'd say they are a few things you need to concentrate on:
1. Searching: I'm not sure about how regexes (ie, Matcher.find()) work, but I do know that indexOf() is sometimes not an optimal search because it's sequential. For more general searches you may want to look at something like the Boyer-Moore-Horspool algorithm, but in your case (searching for a delimiter), it's almost certainly overkill, and I would say indexOf() should be be just fine.

2. Storage of your "token" information: A HashMap sounds perfectly reasonable for your tokens and replacements, but I'd say you don't want to split() the String you're searching.

My advice, starting from the beginning of the String:
a. Find a delimiter.
b. Check if the substring before it matches any of your tokens to be replaced. If it does, copy the replacement token to the output; otherwise just copy the substring.
c. Copy the delimiter to the output.
And repeat that with the substring starting at the character after the delimiter until there are no more delimiters to find. Chances are that you'll still have a "last token" to process, in which case do check (b) on it.

In your case you may have a little bit more work to do, but the basic pattern is the same, and it will run in O(n) time.

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
B M Tejaswi
Greenhorn

Joined: Jul 02, 2012
Posts: 11
With the suggestions that i have got from the learned ones in this forum, i tried to alter the code and have come up with the following code :


I have performed some test and below are the reults-
Number of input words in string = 120000
Number of total encoding pairs = 4
Number of encoded words in output = 20000
Number of iterations = 1000
Total time in millis = 1009033
Avg time in millis = 1009

but am not happy with the results..I want an avg time of around 600 millis..please help
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7892
    
  21

Tejaswi Bm wrote:With the suggestions that i have got from the learned ones in this forum, i tried to alter the code
...
but am not happy with the results..I want an avg time of around 600 millis..please help

Well first, your coding style could do with some improvement: Specifically, don't use '_' in field names. It may be great for Python, but it looks awful to most Java programmers. Also, StringTokenizer is a legacy class, and its use is discouraged. You can do exactly the same thing with a Scanner; although either one will be quite a bit slower than the technique I suggested in my previous post.

It seems to me that you're overthinking this thing by a mile. Substitution is very straightforward:
1. Find a token.
2. Check if it's in your HashMap:
  • if it is, add its substitute to the output.
  • 3. If it isn't, check if it's numeric:
  • if it is, add its "encrypted" version to the output.
  • 4. Otherwise, add it to the output as is.

    It also appears that you're replacing all your delimiters with spaces. That's fine, but I didn't see any suggestion that that's what you wanted.

    My advice: Write down ALL the rules for your "replace" in English before you write another line of code.

    Also: have a look at String.substring(). In general, its the fastest way of returning a substring (or eliminating part of a String) because it doesn't involve any copying of the internal character array.

    I'd also suggest that a numericTokenMap is overkill, and probably a lot slower than simply substituting numeric characters from an array, eg:particularly if you're only interested in the characters '0' to '9' (which is NOT the same thing as a 'numeric' character).

    HIH

    Winston
    William Brogden
    Author and all-around good cowpoke
    Rancher

    Joined: Mar 22, 2000
    Posts: 12791
        
        5
    My advice: Write down ALL the rules for your "replace" in English before you write another line of code.
    ++ good advice

    Be sure to include rules for handling recognizing words with differences in case/CASE/Case ...

    Your method is slow because it involves creation of a number of objects and tries to start with the full string in memory - for speed go with stream processing as Winston and I suggested.

    Bill
    B M Tejaswi
    Greenhorn

    Joined: Jul 02, 2012
    Posts: 11
    But i cannot read the input as stream since i need to prepare the encoding pairs..so i have to read the entire input string and then split based on delimiter!correct me if am wrong..if stream can be tokenized then please do guide me with the code.
    William Brogden
    Author and all-around good cowpoke
    Rancher

    Joined: Mar 22, 2000
    Posts: 12791
        
        5
    In your OP, you said the "encoding pair" is given. Where does a "encoding pair" come from?

    The Java io package has the StringReader class which will let you read a String as a stream of characters.

    Have you followed Winston's advice to write down ALL the rules of your desired "replace" function?

    Bill

    B M Tejaswi
    Greenhorn

    Joined: Jul 02, 2012
    Posts: 11
    Encoding pair is given as part of input string. Encoding pair is separated by "===" in the input string.
    Winston Gutkowski
    Bartender

    Joined: Mar 17, 2011
    Posts: 7892
        
      21

    B M Tejaswi wrote:Encoding pair is given as part of input string. Encoding pair is separated by "===" in the input string.

    Right, so there is some sort of format to your input that your program will have to deal with. This is all part of the "rules" that I mentioned above.

    Have you written them ALL down?

    Winston
     
    GeeCON Prague 2014
     
    subject: String find and replace