Hi there,
I have quite a tricky problem in regards to some text transformation thats happening in a class I'm working with. This is more a request for comments than a request for a solution.
Anyway, I have a utility class which takes a
String of arbitrary length and transforms it in several different ways before returning a String result. The class is called TextParser and does some of the following things:
1) Transforms plain text into HTML
2) Uses a number of regular expressions to identify hyperlinks, email addresses, names, dates and so on, and marks these up in HTML.
My problem is two-fold:
1) For very large Strings the time taken for all of the regular expressions to run is extremely high. For example, for a String taken from a 2.2MB file, the time taken for the class to finish is 44 minutes. This is obviously a major problem if we expect to be able to deal with files of this size.
2) For very large Strings combined with server heap sizes of around 128MB this class often causes OutOfMemoryErrors. These always originate from parts of the class that use StringBuffers, particularly areas which do some form of find/replace operation. This problem was also found with the above 2.2MB file. Again, this is a problem if we expect to be able to deal with large files. It no good simply allocating more memory if in a month someone causes the same problem by submitting a monster 8MB file.
I have pondered a few solutions to these problems and would appreciate any comments or observations any of you have.
1) Find a new regular expression library. I've had a long around and I've seen several. At the moment we are using the Sun regex package classes, which to my suprise perform considerably well as compared to other
Java regex libraries. The only other 2 which seem to perform better are:
a)
http://jakarta.apache.org/oro/index.html which some benchmarks suggest is slower, and
b)
http://www.brics.dk/~amoeller/automaton/ which is blisteringly fast, but with Javadocs I can't get my head around.
2) Possible use of streams and temporary files. I was wondering if I could somehow stream a large string in from the source file/db rather than sticking it all in memory in a dirty great big StringBuffer. Then maybe I could use regular expressions to perform find/replace operations on the part of the string in memory. I have no idea where to start with this.
3) Stop using StringBuffer. Its synchronized. Why is it synchronized, ffs? I cant use StringBuilder in 5.0, as we havent migrated to 5.0 yet, but I was thinking of writing a StringBuffer class of our own that wasn't synchronized. Still doesnt solve the problem of having very large Strings in memory though, as potentially it could still cause OutOfMemoryErrors given a large enough string.
Has anyone had any similar problems to these?
Cheers
Jon