Hi there, I have quite a tricky problem in regards to some text transformation thats happening in a class I'm working with. This is more a request for comments than a request for a solution.
Anyway, I have a utility class which takes a String of arbitrary length and transforms it in several different ways before returning a String result. The class is called TextParser and does some of the following things:
1) Transforms plain text into HTML 2) Uses a number of regular expressions to identify hyperlinks, email addresses, names, dates and so on, and marks these up in HTML.
My problem is two-fold:
1) For very large Strings the time taken for all of the regular expressions to run is extremely high. For example, for a String taken from a 2.2MB file, the time taken for the class to finish is 44 minutes. This is obviously a major problem if we expect to be able to deal with files of this size.
2) For very large Strings combined with server heap sizes of around 128MB this class often causes OutOfMemoryErrors. These always originate from parts of the class that use StringBuffers, particularly areas which do some form of find/replace operation. This problem was also found with the above 2.2MB file. Again, this is a problem if we expect to be able to deal with large files. It no good simply allocating more memory if in a month someone causes the same problem by submitting a monster 8MB file.
I have pondered a few solutions to these problems and would appreciate any comments or observations any of you have.
1) Find a new regular expression library. I've had a long around and I've seen several. At the moment we are using the Sun regex package classes, which to my suprise perform considerably well as compared to other Java regex libraries. The only other 2 which seem to perform better are:
2) Possible use of streams and temporary files. I was wondering if I could somehow stream a large string in from the source file/db rather than sticking it all in memory in a dirty great big StringBuffer. Then maybe I could use regular expressions to perform find/replace operations on the part of the string in memory. I have no idea where to start with this.
3) Stop using StringBuffer. Its synchronized. Why is it synchronized, ffs? I cant use StringBuilder in 5.0, as we havent migrated to 5.0 yet, but I was thinking of writing a StringBuffer class of our own that wasn't synchronized. Still doesnt solve the problem of having very large Strings in memory though, as potentially it could still cause OutOfMemoryErrors given a large enough string.
Maybe instead of using regular epressions, you could use a lexer? That would process the string one character/one token at a time, which should be considerably faster than trying to match the whole string to a complicated regexp. One lexer I like, and found easy to start with, is JFlex. [ December 07, 2005: Message edited by: Ulf Dittmer ]
If you don't have any control over the file size, then no, you shouldn't be reading the whole thing into memory. If you know the tokens you're looking for will never span multiple lines, your best bet is to read it in, process it and write it back out one line at a time. Otherwise, you can probably find another good breaking point, like two or more consecutive linefeeds.
You say you're using several regexes; I assume that means you're making several passes over the text, one for each regex. If that's the case, you should combine all the regexes into one big alternation, and only do one pass. Using capturing groups, you can determine what kind of token you matched afterward. Doing the replacements this way requires using lower-level Matcher API like group(), appendReplacement() and appendTail(), but it can speed things up enormously. I'll elaborate if you're interested.
Also, your regexes may be slower than they have to be. No insult intended, but it's very easy to shoot yourself in the foot with regexes.
Switching to another of the "standard" Java regex packages won't help; according to Friedl, java.util.regex is the fastest one overall. As for the BRICS package, I agree with you about the documentation--I can't even see how to retrieve the matched text! That package wasn't written for mere mortals like us.
Oh, and the fact that StringBuffer is synchronized is probably not a consideration; synchronization isn't nearly as expensive as it used to be. You definitely shouldn't worry about that until you're sure you're handling the IO and the string manipulations as efficiently as possible.
As newbie to java (Head First) I'm struggeling with the same file/regex issue. Concerning the StringBuilder/StringBuffer issue. I found that the appendReplacement method of the Matcher only accepts StringBuffers, not StringBuilders, eventhough the Matcher api claims to operate on a character sequence. (StringBuilder and StringBuffer implement Character sequence) The Pattern matcher method however does except character sequences.
In short you can create a Pattern, compile it, match it and have the Matcher find it, but you can't do any replacements. Or (more likely), as a newbie I'm horribly misunderstanding something.
Joined: May 06, 2004
The CharSequence that you supply to the matcher() method is what you match against; the StringBuffer that you supply to appendReplacement()/appendTail() is what you use to build the result of the replacement. The two are completely separate. The match target can be a CharSequence because these classes only read it, never write to it (CharSequence doesn't specify any mutation methods). The appendXXX() methods, on the other hand, require a mutable class. At the time the regex package was introduced, StringBuilder didn't exist yet, so we're stuck with StringBuffer. But, as I said earlier, that probably won't be a problem.