It's not a secret anymore!*
The moose likes Performance and the fly likes String optimization weirdness Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Performance
Bookmark "String optimization weirdness" Watch "String optimization weirdness" New topic
Author

String optimization weirdness

Arron Ferguson
Greenhorn

Joined: Feb 01, 2005
Posts: 28
Ok, so I thought I know Java ... but I guess I don't. I altered some code I grabbed from a Google search, altered it to fit my needs and came up with this:

import java.util.Hashtable;
import java.util.Enumeration;

public class EntityResolver2
{
private static Hashtable<String, String> XMLEntityLookupTable;

private static String temp = null;

static
{
XMLEntityLookupTable = new Hashtable<String, String>();
XMLEntityLookupTable.put("&", "&");
XMLEntityLookupTable.put("'", "'");
XMLEntityLookupTable.put(">", ">");
XMLEntityLookupTable.put("<", "<");
XMLEntityLookupTable.put("\"", """);
}

public static String replaceIllegalCharacters(String str)
{
// get the keys:
Enumeration<String> illegalChars = XMLEntityLookupTable.keys();

temp = str;

while(illegalChars.hasMoreElements())
{
// got the keys
String illegalChar = illegalChars.nextElement();
String entityReplacement = XMLEntityLookupTable.get(illegalChar);
temp = replace(temp, illegalChar, entityReplacement);
}

return temp;
}

private static String replace(String str, String pattern, String replace)
{
int s = 0;
int e = 0;

StringBuffer result = new StringBuffer();

while ((e = str.indexOf(pattern, s)) >= 0)
{
result.append(str.substring(s, e));
result.append(replace);
s = e + pattern.length();
}
result.append(str.substring(s));
return result.toString();
}

}

After looking at if for a while I decided, hey, this could be optimized. After all, I'm using a Hashtable (synchronized), Enumeration (also synchronized), StringBuffer (another one that's synchronized) a couple of while loops, two method calls within the class and a whole whack of method calls to other objects of the other classes (the synchronized ones). So, I remembered my techniques from the old C programming days and decided to go "old school" on this thing and used char arrays like this:

public class EntityResolver
{
private static char[] cdataArray;

private static char[] temp;

public static String replaceIllegalCharacters(String cdata)
{
cdataArray = cdata.toCharArray();

int size = cdataArray.length;
for(int i = 0; i < size; i++)
{
char c = cdataArray[i];
switch(c)
{
case '<':
replace('<', i, "<");
break;
case '>':
replace('>', i, ">");
break;
case '\'':
replace('\'', i, "'");
break;
case '"':
replace('"', i, """);
break;
case '&':
replace('&', i, "&");
break;
}
size = cdataArray.length;
}

return new String(cdataArray);
}

private static void replace(char replace, int index,
String replacement)
{
temp = new char[cdataArray.length -1 + replacement.length()];
int tempLength = temp.length;

int i = 0;
for(; i < index; i++)
{
temp[i] = cdataArray[i];
}

char[] replacementChars = replacement.toCharArray();
int replacementCharsLength = replacementChars.length;
for(int j = 0; j < replacementCharsLength; j++)
{
temp[i] = replacementChars[j];
i++;
}

index++;
for(int k = i; k < tempLength; k++)
{
temp[k] = cdataArray[index];
index++;
}

cdataArray = temp;
}

}

I mean this has got to be faster right? I mean, I'm not using objects, I've severely culled my method calls which can be quite costly in performance and I've created local variables for the length values for arrays to cut down on referencing. The compare values are char literals instead of sitting in a synchronized hashing function class (ala HashTable). Maybe I've just been up too long but I have no idea why the char array version is slower.

They both functionally perform the same operation and they both work. By the way, what they are doing is accepting a string (of any size), scanning it for illegal XML characters ('<', '>', '&', ''', '"') and replacing them with entities ('<', '>', '&', ''', '"'). and returning the "legal" string.



Bored? Any takers?

a.

Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24187
    
  34

Is there some reason you're not using the String.replace() method? The vast majority of your code could be replaced by a short loop that calls this method once for each escapeable character.


[Jess in Action][AskingGoodQuestions]
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18977
    
  40

After all, I'm using a Hashtable (synchronized), Enumeration (also synchronized), StringBuffer (another one that's synchronized) a couple of while loops, two method calls within the class and a whole whack of method calls to other objects of the other classes (the synchronized ones).


I am assuming that you are using JDK 1.5, since you are using generics.

First, with JDK 1.5, synchronization has been greatly improved. In fact, the time to obtain an uncontended lock is almost non-existant. Since you are single threaded, all of your lock grabs are uncontended.

Second, the optimization has also been improved. I too, am "old school", but have learned that it is not very productive, trying to beat the optimizer.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Moving to our Performance forum...


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Arron Ferguson
Greenhorn

Joined: Feb 01, 2005
Posts: 28
Thank you for the replies. Sorry I didn't put this in the performance forum in the first place. But I still haven't seen any silver bullet that clearly, definitively explains why this happens. On the contrary, the rabbit hole is getting deeper. Ernest, you suggested the replace method in the String class. I assume you're talking about the:



At line 2018 (String.java). This replace method (if you look through the JDK src), calls Pattern.compile(...), which, in turn, creates a new Pattern object (line 840 in Pattern.java) which calls its own private constructor (line 1115 of Pattern.java), which calls its private compile method, which calls normalize, which finally calls up sun.text.Normalizer by saying:



Although there's no src for the Normalizer, this isn't by any means where the buck stops since this method (still inside of the normalize method) calls up a StringBuilder (actually a couple of them). Although I don't have a code analyzer handy, I suspect that the "Java-way" of doing this creates dozens, probably even hundreds of objects to get the task done - the Pattern class alone has 81 inner classes.

Out of all of the classes that I've seen referenced (I can't trace the Normalizer since it's source code is not available), only the String class's method is native and it's only called in toUpperCase and toLowerCase - neither of which I'm making use of.

Even more confusing than it was before now that I've looked through some of the code.


... Why oh why didn't I take the blue pill
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24187
    
  34

As you've recounted, compiling a regular expression can take a lot of work. But you can compile it once, and use it many times, and you gain back some of what you've lost in efficient search and replace. In your case, your "regular expressions" are just simple string matches, and so compiling them is cheap, anyway.

I don't actually know about Java's regexp engine, but a decent regexp engine can use something like the Boyer-Moore search algorithm to find matches. Boyer-Moore is considerably faster then the "look at every character" technique you're using for search strings longer than one character; in your case, though, I guess that's a wash.

But my point is that the first thing you want to do when you're coding is to write the simplest possible routine, one that uses the standard library when possible. When you run the program, if it is determined that the performance isn't good enough, then and only then should you start trying to fine-tune things by replacing standard algorithms.

It seems like you've got a fear of small objects, and a fear of allocation, and a fear of function calls, and a fear of a whole bunch of little things that JVMs are very deliberately optimized so as to make them as cheap as possible. Don't worry about allocating lots of little objects until you've written a program, profiled it, and measured that it's too slow because it's creating lots of little objects. Worrying about it ahead of time just leads to crufty, unnecessarily complicated code that people will curse when it comes time to maintain.
Arron Ferguson
Greenhorn

Joined: Feb 01, 2005
Posts: 28
Ernest,

Thanks for that reply. Yes, you are right about the regular expressions - compiling them is good and is what the runtime does. Of course the whole treatment of strings in Java is quite specialized too.

I'm not really afraid of objects, function calls, etc., believe it or not this was one of my few non-object solutions but it was one of those things that was "bugging me" why it gave such bad performance. This has reminded me of just how far it's all come and once again that it's very difficult to outsmart the VM.
Cormac Blackwell
Greenhorn

Joined: Oct 14, 2005
Posts: 5
Well you are creating a new char array in a loop and everytime you find a character you want to replace you are copying the whole string!
Try creating a single char array at the beginning which is twice the size you think you will need, then for each char either copy it or the replacement in (and resize the output array only if necessary and double the size if you have to resize it at all). Then you can contruct a string from the section of the array you have used. Also you probably should store the replacement strings as char arrays if you are just going to call toCharArray on them.

Actually I have a vague memory that toCharArray isn't that fast so you could be better off switching on string.charAt and appending to a single StringBuilder.
Harald Kirsch
Ranch Hand

Joined: Oct 14, 2005
Posts: 37
Hi,

in your "old-style" solution you copy lots of characters from one array to another in a for-loop, like

You could try to replace this with System.arraycopy calls and see if it is still much slower than the HashMap-based solution.

And apart from that, lets count how often you loop over the whole cdata string. You do it once in replaceIllegalCharacters plus once for each offending character. The HashMap-solution loops only 5 times.

Harald.


Harald.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: String optimization weirdness