This week's book giveaway is in the Other Open Source APIs forum. We're giving away four copies of Storm Applied and have Sean Allen, Peter Pathirana & Matthew Jankowski on-line! See this thread for details.
Usually I can find solutions to problems by just searching this site; failing in that, a Google search usually works. But this time I've been foiled at every turn. So I'll see if you folks can offer a solution.
My problem is this: my program takes two XML files as inputs. It has been reading them happily for more than a week now. But today I did something that radically changed things. Whereas previously the two XML files were read and parsed in less than a second, now they're taking 10 to 20 seconds.
My web searches found one possible solution: disable the auto-checking of the parser by adding these lines of code:
A number of sites offered this solution, but it didn't work for me. Another site had a suggestion about using BufferedInputStreams rather than plain old everyday InputStreams. That didn't help, either.
What strikes me as most significant is the fact that XML parsing was working just fine earlier. So I decided to revert to an earlier version that I know worked just fine. Using Time Machine I restored the version I had before I started working this morning. Then, to be doubly safe, I restarted my Mac, relaunched Eclipse, and ran it with the old code. The same thing happened! It was still running slow!
At this point I really freaked. Trying to find SOME sort of firm ground, I brought up an old project that reads an XML file. It ran just fine, parsing the XML file lightning fast. So I set to work trying to divine what possible difference there was between the two XML parsing routines. I found one tiny difference; I duplicated it in the slow program and it didn't help at all. The two programs diverged where they started extracting specific data, but the basis structure of both is identical. Here's the version that ran slow:
At this point, my best explanation is that my computer has been taken over by an international gang of sadists who like to mess with people's minds.
Does anybody have any better explanations?
Does your xml files contain any xsd or dtd declarations? Are those xsd/dtd files hosted locally (within some jar) or are they somewhere on the internet? Sometimes, when parsing an xml, the parser goes looking for the xsd/dtd and that can usually take time or even fail (in case of no network connectivity).
There are various ways, you could narrow it down, the use of a profiler is one option.
Isn't that about as plain vanilla as I can get? I tried cutting out the URL in the second line, leaving merely <expressionSet>, but that didn't help, either. The thing is still horribly slow. I just *know* that this has to be something stupid on my part, some obvious clunker... but I can't figure it out.
Yes, that's just a namespace URI, so it doesn't correspond to an actual URL and the parser won't attempt to resolve it as a URL. But there might be other actual URLs which are being resolved by the parser elsewhere in the system. So Jaikiran's idea of network access might still be the answer.
I was going to suggest running something which tracks your network traffic, and that might be the ultimate answer, but at first it would be simpler to make sure you are not connected to the web, and then try timing the parse again. If the network theory is right then either it should hang, or else terminate with an error.
To rule out one other external factor, make sure you don't have any anti-virus or some such firewall kind of software running (especially on Windows OS) when this program is running. I say this because I have seen some posts where Java programs ran slower than usual when the anti-virus software was up.
If that doesn't change things, then I think you might want to attach profiler or even add some System.out.print messages at relevant points within the program to print out the timing information for each of those important blocks in the code. That will help you narrow down which part of the program is taking time.
Lying in bed awake at 5:00 AM, an idea occurred to me: what if I'm using libraries that clash? Here's the relevant portion of my imports:
This appears to be the same as that on the code that parses the XML quickly. Still, I worry that there are two different sources for parsing XML. I ran through the various declarations and discovered that there are multiple versions of "Document", "Node", and other such classes; it appears that there are versions from the javax library as well as from org.w3c.dom.css. Are there any guidelines as to organizing these? Looking over the Java docs at http://docs.oracle.com/javase/7/docs, I found that there are some major differences between the javax versions and the w3c versions. I searched around this site and found nothing. Searching around the Internet, I came across this article http://www.vogella.com/tutorials/JavaXML/article.html, which urges me to use StaX, and provides sample code. I am tempted to try this, but I'd like to get some advice on this before I take the leap. Is this fellow correct in asserting that the DOM and SAX approaches are obsolete? Is StaX the "only way to fly" these days?
I'll get to work on documenting time delays at various parts in the code.
Wow! Here's a shocker: the slowdown has nothing to do with XML -- it's with this small routine:
When I run this, there's a delay of perhaps 4 seconds between "Kb" and "Kc" the first time it's called, followed by delays of a second or two on subsequent executions. In other words, it's all in the library call "ImageIO.read(tFile)". What in the world is going on? These files are all less than 50KB. They're in the standard res directory along with all the other data sources. This code has been executing lickety-split since Day One, yet now it's suddenly gotten lazy.
I'm at quite a loss to explain this change. I'll continue digging.
Have you checked memory? Many times, memory leaks(or just plain overusage of memory) manifests as perofrmance problems.
The thing about Java is that the GC is always running in the background. ANd when GC thinks that it needs to do a major collection, it will pause everything and do a major collection. For the most part, if you aren not leaking memory, and not overusing memory, GC behaves. It will run only when it absolutely has to. However, what happens when you have too many objects in memory is that that GC keeps running again and again. It's not uncommon for a memory leak to cause a program to spend 95-99% of processing time in GC. I've known programs that have run days with reduced throughput before they reported OOM.
If you feel like a gang of sadists have taken over your program and is making everything run slow, then it's worth checking memory. The sadist might just be GC!
Of course! Yes, that's easily the best explanation. I'll check out memory usage and report back.
[5 minutes later]
Well, that didn't last long. The whole thing takes 574KB. That's the entire directory containing source code, object classes, images, XML files -- everything that is in any way connected to this project. Still, I'll go through and calculate total memory allocations.
[5 minutes later]
No, total memory allocations are in the range of hundreds of K at the most. The greatest storage is going for images, which amount to less than 400K.
Rats. I thought we were on to something.
Now, the perp line of code (bi = ImageIO.read(tFile);) first executed around 08:35:15, and was used several times until 08:35:28. As you can clearly see from the graphs, heap memory shot up to 15 MB at that time, and then quickly fell when it ended. The garbage collector kicked in once, it would seem. But look at CPU usage: it peaked at 100% during the time when ImageIO.read() was at work. So we're definitely NOT waiting for some URL to connect. Somehow ImageIO.read() is getting hung up.
I'll keep gathering information on this as it occurs to me.
Yet more information: I have learned that the amount of time taken by ImageIO.read() is roughly proportional to the size of the image, with a proportionality constant of 0.2 seconds per KB. My images run about 80 KB in size, so that explains why it takes 16 seconds for those big images. The small images, however, take almost no time. Indeed, it looks to me as if the line has an X-intercept of about 12 KB, meaning that it loads any image smaller than 12 KB in a time so short that I can't perceive it.
Could it be that my ImageIO lib has been corrupted?
Well, yeah it makes perfect sense that reading an image will be a CPU and memory intensive operation, and also the amount of time and memory spent in loading the image is going to be proportional to the size of the image. There is no surprise there.
The question is
a) is ImageIO slower than what you expect it to be? if you think that's the case, then make a SSSCE that just runs ImageIO in a loop and measures the throughput of ImageIO is isolation and compare it with ImageIO in your program
b) can you make ImageIO any faster? Depending on what file formats you are using, you might find that you might be able to load images faster by changing the format
c) do you need to do ImageIO in the critical path? Maybe you can preload the images
Mr. Lalwani, I don't think you've gotten the essence of my problem. This behavior in ImageIO.read() is nowhere near what it should be. If this were standard behavior, then loading a typical digital photograph would consume about 30 minutes. That's absurd. There is something wrong with my own implementation. I am beginning to suspect that the problem lies with a corruption in my libraries. I am now investigating that possibility.
It all depends. a digital photograph stored as a GIF file is going to take a lot longer to load than a photograph stored as a JPEG file. Image IO is a lot more complicated than the API lets on. There are a lot of variables that determine how a particular file will perform. The API straight up hides all the complexity from you. The problem could be the format of the files.
At this point, I would make a small sample that does ImageIO and nothing else. You can play around with differrent kind of files, and also make check whether the problem is in the ImageIO library itself or in the way you are using it.
Well, I have solved the problem, but I don't know how. I replaced everything in the workspace with the stuff from January 13th. The project that has given me so much trouble didn't exist on January 13th. I re-assembled the project from scratch, copying and pasting text from the old source files (which I had saved) into the new source files. Now everything works as before -- the whole thing loads in a flash.
I call this a "Neanderthal solution" -- I have no idea why it worked, but it did work, which is good enough for now. Nevertheless, I shall be very careful about keeping track of exactly what changes I make.
Thanks for the suggestions. I now believe that this problem really wasn't with Java, but was instead somehow idiosyncratic with my system.
Next time, I would recommend to try to get some thread dumps to analyze what were the states of the thread. This way would be easier to find out. I already got some problem of the lack (or very restrict) of internet access was the root cause (due the XML validation).
I am increasingly confident that problem lay with my computer, a 2006 Mac Pro running OS 10.7.5. I had noticed that it was having more and more problems with the latest software. I finally broke down and bought a new iMac, and now everything is working flawlessly -- except, of course, for those things that I screw up all by myself.