• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Tim Cooke
  • Devaka Cooray
Sheriffs:
  • Liutauras Vilda
  • paul wheaton
  • Rob Spoor
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Piet Souris
  • Mikalai Zaikin
Bartenders:
  • Carey Brown
  • Roland Mueller

String optimization weirdness

 
Greenhorn
Posts: 28
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ok, so I thought I know Java ... but I guess I don't. I altered some code I grabbed from a Google search, altered it to fit my needs and came up with this:

import java.util.Hashtable;
import java.util.Enumeration;

public class EntityResolver2
{
private static Hashtable<String, String> XMLEntityLookupTable;

private static String temp = null;

static
{
XMLEntityLookupTable = new Hashtable<String, String>();
XMLEntityLookupTable.put("&", "&");
XMLEntityLookupTable.put("'", "'");
XMLEntityLookupTable.put(">", ">");
XMLEntityLookupTable.put("<", "<");
XMLEntityLookupTable.put("\"", """);
}

public static String replaceIllegalCharacters(String str)
{
// get the keys:
Enumeration<String> illegalChars = XMLEntityLookupTable.keys();

temp = str;

while(illegalChars.hasMoreElements())
{
// got the keys
String illegalChar = illegalChars.nextElement();
String entityReplacement = XMLEntityLookupTable.get(illegalChar);
temp = replace(temp, illegalChar, entityReplacement);
}

return temp;
}

private static String replace(String str, String pattern, String replace)
{
int s = 0;
int e = 0;

StringBuffer result = new StringBuffer();

while ((e = str.indexOf(pattern, s)) >= 0)
{
result.append(str.substring(s, e));
result.append(replace);
s = e + pattern.length();
}
result.append(str.substring(s));
return result.toString();
}

}

After looking at if for a while I decided, hey, this could be optimized. After all, I'm using a Hashtable (synchronized), Enumeration (also synchronized), StringBuffer (another one that's synchronized) a couple of while loops, two method calls within the class and a whole whack of method calls to other objects of the other classes (the synchronized ones). So, I remembered my techniques from the old C programming days and decided to go "old school" on this thing and used char arrays like this:

public class EntityResolver
{
private static char[] cdataArray;

private static char[] temp;

public static String replaceIllegalCharacters(String cdata)
{
cdataArray = cdata.toCharArray();

int size = cdataArray.length;
for(int i = 0; i < size; i++)
{
char c = cdataArray[i];
switch(c)
{
case '<':
replace('<', i, "<");
break;
case '>':
replace('>', i, ">");
break;
case '\'':
replace('\'', i, "'");
break;
case '"':
replace('"', i, """);
break;
case '&':
replace('&', i, "&");
break;
}
size = cdataArray.length;
}

return new String(cdataArray);
}

private static void replace(char replace, int index,
String replacement)
{
temp = new char[cdataArray.length -1 + replacement.length()];
int tempLength = temp.length;

int i = 0;
for(; i < index; i++)
{
temp[i] = cdataArray[i];
}

char[] replacementChars = replacement.toCharArray();
int replacementCharsLength = replacementChars.length;
for(int j = 0; j < replacementCharsLength; j++)
{
temp[i] = replacementChars[j];
i++;
}

index++;
for(int k = i; k < tempLength; k++)
{
temp[k] = cdataArray[index];
index++;
}

cdataArray = temp;
}

}

I mean this has got to be faster right? I mean, I'm not using objects, I've severely culled my method calls which can be quite costly in performance and I've created local variables for the length values for arrays to cut down on referencing. The compare values are char literals instead of sitting in a synchronized hashing function class (ala HashTable). Maybe I've just been up too long but I have no idea why the char array version is slower.

They both functionally perform the same operation and they both work. By the way, what they are doing is accepting a string (of any size), scanning it for illegal XML characters ('<', '>', '&', ''', '"') and replacing them with entities ('<', '>', '&', ''', '"'). and returning the "legal" string.



Bored? Any takers?

a.

 
author and iconoclast
Posts: 24207
46
Mac OS X Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Is there some reason you're not using the String.replace() method? The vast majority of your code could be replaced by a short loop that calls this method once for each escapeable character.
 
author
Posts: 23956
142
jQuery Eclipse IDE Firefox Browser VI Editor C++ Chrome Java Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

After all, I'm using a Hashtable (synchronized), Enumeration (also synchronized), StringBuffer (another one that's synchronized) a couple of while loops, two method calls within the class and a whole whack of method calls to other objects of the other classes (the synchronized ones).



I am assuming that you are using JDK 1.5, since you are using generics.

First, with JDK 1.5, synchronization has been greatly improved. In fact, the time to obtain an uncontended lock is almost non-existant. Since you are single threaded, all of your lock grabs are uncontended.

Second, the optimization has also been improved. I too, am "old school", but have learned that it is not very productive, trying to beat the optimizer.

Henry
 
author
Posts: 14112
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Moving to our Performance forum...
 
Arron Ferguson
Greenhorn
Posts: 28
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you for the replies. Sorry I didn't put this in the performance forum in the first place. But I still haven't seen any silver bullet that clearly, definitively explains why this happens. On the contrary, the rabbit hole is getting deeper. Ernest, you suggested the replace method in the String class. I assume you're talking about the:



At line 2018 (String.java). This replace method (if you look through the JDK src), calls Pattern.compile(...), which, in turn, creates a new Pattern object (line 840 in Pattern.java) which calls its own private constructor (line 1115 of Pattern.java), which calls its private compile method, which calls normalize, which finally calls up sun.text.Normalizer by saying:



Although there's no src for the Normalizer, this isn't by any means where the buck stops since this method (still inside of the normalize method) calls up a StringBuilder (actually a couple of them). Although I don't have a code analyzer handy, I suspect that the "Java-way" of doing this creates dozens, probably even hundreds of objects to get the task done - the Pattern class alone has 81 inner classes.

Out of all of the classes that I've seen referenced (I can't trace the Normalizer since it's source code is not available), only the String class's method is native and it's only called in toUpperCase and toLowerCase - neither of which I'm making use of.

Even more confusing than it was before now that I've looked through some of the code.


... Why oh why didn't I take the blue pill
 
Ernest Friedman-Hill
author and iconoclast
Posts: 24207
46
Mac OS X Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
As you've recounted, compiling a regular expression can take a lot of work. But you can compile it once, and use it many times, and you gain back some of what you've lost in efficient search and replace. In your case, your "regular expressions" are just simple string matches, and so compiling them is cheap, anyway.

I don't actually know about Java's regexp engine, but a decent regexp engine can use something like the Boyer-Moore search algorithm to find matches. Boyer-Moore is considerably faster then the "look at every character" technique you're using for search strings longer than one character; in your case, though, I guess that's a wash.

But my point is that the first thing you want to do when you're coding is to write the simplest possible routine, one that uses the standard library when possible. When you run the program, if it is determined that the performance isn't good enough, then and only then should you start trying to fine-tune things by replacing standard algorithms.

It seems like you've got a fear of small objects, and a fear of allocation, and a fear of function calls, and a fear of a whole bunch of little things that JVMs are very deliberately optimized so as to make them as cheap as possible. Don't worry about allocating lots of little objects until you've written a program, profiled it, and measured that it's too slow because it's creating lots of little objects. Worrying about it ahead of time just leads to crufty, unnecessarily complicated code that people will curse when it comes time to maintain.
 
Arron Ferguson
Greenhorn
Posts: 28
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ernest,

Thanks for that reply. Yes, you are right about the regular expressions - compiling them is good and is what the runtime does. Of course the whole treatment of strings in Java is quite specialized too.

I'm not really afraid of objects, function calls, etc., believe it or not this was one of my few non-object solutions but it was one of those things that was "bugging me" why it gave such bad performance. This has reminded me of just how far it's all come and once again that it's very difficult to outsmart the VM.
 
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well you are creating a new char array in a loop and everytime you find a character you want to replace you are copying the whole string!
Try creating a single char array at the beginning which is twice the size you think you will need, then for each char either copy it or the replacement in (and resize the output array only if necessary and double the size if you have to resize it at all). Then you can contruct a string from the section of the array you have used. Also you probably should store the replacement strings as char arrays if you are just going to call toCharArray on them.

Actually I have a vague memory that toCharArray isn't that fast so you could be better off switching on string.charAt and appending to a single StringBuilder.
 
Ranch Hand
Posts: 37
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,

in your "old-style" solution you copy lots of characters from one array to another in a for-loop, like

You could try to replace this with System.arraycopy calls and see if it is still much slower than the HashMap-based solution.

And apart from that, lets count how often you loop over the whole cdata string. You do it once in replaceIllegalCharacters plus once for each offending character. The HashMap-solution loops only 5 times.

Harald.
 
It runs on an internal combustion engine. This ad does not:
We need your help - Coderanch server fundraiser
https://coderanch.com/wiki/782867/Coderanch-server-fundraiser
reply
    Bookmark Topic Watch Topic
  • New Topic