permaculture playing cards*
The moose likes Performance and the fly likes Need advice on improving performance Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Need advice on improving performance" Watch "Need advice on improving performance" New topic
Author

Need advice on improving performance

Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46


Hey all,

I need some help in improving the performance of this program that I am writing.

I have an input file containing about 20000 lines. Each line has 12 fields which represent different data. I need to read each line and perform about 80 rules on them which makes sure the information in the line is correct - if not correct them - and pass on the information. All these rules(except a handful) query one of about 25 databases to perform the required check. I require the processing to be completed within 10-12 minutes which is when I receive another file for processing.

Here is what I have done upto now.

I have read the file line by line and stored into an array. For each line, I split the line to the 12 fields and then perform all the rules on the line one by one.

Since I figured doing JDBC query all the time would be a bad idea, I am doing all the JDBC query before I start processing the input and pass the data in as String. I store the same in a HashMap with the tables' primary key as the key and the content (taken as string separated by a delimiter) as the value. I have stored 15 tables that I use as HashMaps. I get values from them using the

map.get(key)

function which does the required pretty well. The only thing that I am not able to do is perform the operation within 10-12 minutes.

I ran the application with about 1500 records. This took about 1 minute to complete. I also did a print in the loop so that I can see where the program is running at the moment (bad way!! I know!!). It was running pretty fast enough.

When the input is say 15000 (one extra zero - for those who didn't notice), the application takes about 40 minutes to complete. The print inside the loop was also working weirdly. I cant seem to figure out why. After about 2000-3000, the print seems to be halting for about a second or so after 10-15 input lines. This means the application is taking a break every 15 records. Does this have anything to do with the HashMap that I am using.

Also, I did one more test just so that the HashMap is a convincing option. There are 15 or so tables which I require to query. As I said I do the query, put the output as delimited strings and put them in a HashMap. I calculated the time taken to put all the data into a HashMap, iterate through it using the Iterator and print the same. This whole thing took only about 2-3 seconds for all tables which would come to about 72000 records.

Can you suggest any improvements I can do on this whole program?? Can you suggest any other way in which the same processing can be done??

Would it help to change the hashCode() or maybe write my own get function for the HashMap? Please suggest

Please do ask if you need any more clarifications on what I am doing?

Thanks in advance

Max


-- Maximus Moothiringus
Preparing for SCJA!!
Finally figured out that giving <br/> splits the profile biography but not the signature
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4646
    
    5

Just a guess, but it sounds like you are memory thashing.....
Something in your algorithm is O(n^2) or worse.

First, try to reduce N. Try to block the input into smaller units, so N will be smaller.

The standard optimization rules hold:
Profile the code and see what is slow, fix that.
repeat until done.
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

Hmm.... That seems to be a decent explanation.. I know its a lot of memory.. But I have to get the data processed. And I am sure querying the database for all the time wont do much good.

I feel that the current code is running in a good manner. Profiling is something I have not done much. But yes off to that then.

I will get back with some reports on that.

Max
Peter Taucher
Ranch Hand

Joined: Nov 18, 2006
Posts: 174
Your database cache based on string manipulation (Tokenizer or split?) is probably very bad. Can't you put a collection of Strings (or even better a specific dataholder object for each table) as values in your HashMap ... in that case you don't need to process delimiters for each lookup.

Like Pat said, generally you'd want to look with a profiler (netbeans, jprofiler, etc.) on your process to determine performance and/or memory leaks. Generally creating instances is very time consuming, as is string manipulation (because there may be created several new objects for a single line of code).


Censorship is the younger of two shameful sisters, the older one bears the name inquisition.
-- Johann Nepomuk Nestroy
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

Yes, I am starting off with some profiling. Just finished installing JProfiler.

Also Peter, How do you think I can minimize the split() that I am donig.

Here's what I am doing right now.
I am getting the string from the hashmap and then splitting to get each column.

Do you suggest that instead of putting string, if I put in objects with setter getter which goes with the same design as the database the access will be easier??

Max
Peter Taucher
Ranch Hand

Joined: Nov 18, 2006
Posts: 174
Max Moothiringodan wrote:Do you suggest that instead of putting string, if I put in objects with setter getter which goes with the same design as the database the access will be easier??

Yes, that would be my suggestion.
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

@Peter
I'll try doing that also..

One other question that I had came up while I was experimenting with the code.

I didnt change any part of the code upto now. But experimented with different sizes of the inputs.

I ran the whole code with an input of 1000 lines. It ran in a minute or so.
I ran the code again but added 1000 more lines to the code. I can understand clearly that the increase in time is not linear. How does this happen? Is it because of the HashMap or the split. Or is it both.

Max


-edit-
Oh!! I forget!! Thats what profiling is for!! Back to profiling.
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15952
    
  19

One thing that's worth looking at is how "perfect" your hashes are. If the Hashtables are sub-optimal, you'll spend more time chasing overflow links.

Hopefully, you have some way to compile your rules. If you have to interpret them each time you use them, that can be VERY expensive.

Having a periodic "dent" in response is generally something you get when you're running tight on memory and GC kicks in. If there's a way of segregating processing so that you have a smaller working set at any given time, that may be advantages.


Customer surveys are for companies who didn't pay proper attention to begin with.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12760
    
    5
How many database connections do you have open "at one time"?

Are you using a connection pool?

Bill
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

William Brogden wrote:How many database connections do you have open "at one time"?
Are you using a connection pool?
Bill

As I said, I am querying the database before the input file is processed. All the results are made into a string and passed into the code that I am referring to. So, to answer your questions there are no database connections that remain open while the processing is happening. The maximum that can happen are some file reads..

Max
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

Hey All,

an update.

I profiled through the code. The hash map is functioning pretty fine. Hashes seem to be good and unique. HashMap gets are pretty fast.

Like Peter said String manipulation is eating up most of the time. I have to work on that. I have to change all the code so that I have to implement that Object thing that Peter suggested.

@Pat
Yes, I found some places where there are lot of accesses to string append functions and split - which combined eats up a lot. Have found a way to reduce the use through the use of StringBuffer and all

@Tim
Yes, Jprofiler says GC is running at 60% most of the time. I think the use of strings makes use and free up a lot of memory which makes GC run now and then. I intend to make use of other functions and objects to reduce that.

Will implement these changes and profile and post more updates.

Tim Holloway wrote:One thing that's worth looking at is how "perfect" your hashes are. If the Hashtables are sub-optimal, you'll spend more time chasing overflow links.

Could you please explain on this? Or maybe provide links which I can read through.
Tim Holloway wrote:Hopefully, you have some way to compile your rules. If you have to interpret them each time you use them, that can be VERY expensive.

This one too.


Max
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12760
    
    5
there are no database connections that remain open


Do I understand you to say you are opening a new DB connection for each line and then closing it?

Bill
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15952
    
  19

Tim Holloway wrote:
Hopefully, you have some way to compile your rules. If you have to interpret them each time you use them, that can be VERY expensive.


If you're using something like a Rules Engine or a pattern matcher (such as the regex matcher), there's generally an ability to convert the rules from their native language to a more efficient internal form.

On the other hand, if the rules are hard-coded in Java, the best you can do is optimize the code, since the micro-level optimization is already being taken care of by the Java compiler.
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11153
    
  16

Tim Holloway wrote:One thing that's worth looking at is how "perfect" your hashes are. If the Hashtables are sub-optimal, you'll spend more time chasing overflow links.

I THINK what Tim was talking about you already answered. If your hash isn't good, you'll get lots of entries for a given hash value. For example, if my hash returned '1' 50% of the time, and 2-10 5% of the time each, that '1' bucket would be pretty big. You're code would spend a lot of time searching in the one bucket for the actual thing you want.

You seem to imply that your hash is working, and that things are fairly well spread out.


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

William Brogden wrote:
there are no database connections that remain open


Do I understand you to say you are opening a new DB connection for each line and then closing it?

Bill

Nope you got it all wrong. There are as I said about 16000 lines which I need to process. And for each process, I need to verify the data with data from the databases. Since JDBC query within a loop for each of the inputs would be a huge headache, I am doing it all before I start processing the input file.

The data in the database is constant (In the sense, changes only once every 5-6 months - whereas the input file comes one every 10-15 mins. It is just for verification of the input data. I know what all conditions I have to give for querying the database. I know the primary keys of the database. So every time I update the database, I query the database, get all the rows - make them into a string and write them into a file - which I read every time I process an input file. The file that I read, I enter into a HashMap (each line being an entry in it - mapped to its key) which I then use get function to get the whole row. How I process the rest is still to be decided.

So for your question. I use the DB before the input file is processed. Meaning I am not opening a DB connection when I am processing the input file.





Max
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

fred rosenberger wrote:You seem to imply that your hash is working, and that things are fairly well spread out.

Okay. I understand that. When I said my hash is working fine. What I am basing on is the results i got from my jprofiler output of the code. It says the get function is taking only 0.something % (a very meagre amount) of the whole time. I dont know if this is because the split and other string manipulations are taking a lot of time - and thus a larger percentage of the whole time.

Max
Kees Jan Koster
JavaMonitor Support
Rancher

Joined: Mar 31, 2009
Posts: 251
    
    5
Dear Max,

How much memory are you giving this application? How much of that is used? A simple test might be to double the heap space for the app and see if that brings down the GC times.

Kees Jan


Java-monitor, JVM monitoring made easy <- right here on Java Ranch
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11153
    
  16

do you need to read the whole file into memory? Can you read one line, process it, then read the next?
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

fred rosenberger wrote:do you need to read the whole file into memory? Can you read one line, process it, then read the next?

Reading the file and databases are done by another software which passes the strings to this java code that I am writing. The same cannot be changed as far as I understand - business reasons.

Kees Jan Koster wrote:Dear Max,

How much memory are you giving this application? How much of that is used? A simple test might be to double the heap space for the app and see if that brings down the GC times.

Kees Jan

Never fiddled with that one. Also as I said, I might again pass the same to another system. From what I understand, the java code most probably goes into a middleware system which calls the function with the required inputs. I am not fully aware of the whole architecture. But Ill try to change that and see the output.

Max

- update
I am now working on changing all Strings to StringBuffers
I will also change all string manipulations
i will also try to use objects inside hashmap
Kees Jan Koster
JavaMonitor Support
Rancher

Joined: Mar 31, 2009
Posts: 251
    
    5
Dear Max,

Please test with assigning more RAM and let me know. If you never touched your memory settings, you are probably running with the default (128M I believe). Try giving it a gig or two. Read up on the -Xmx command line switch.

Kees Jan
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

Yep, Will do that.
I will have to try that out on a friends PC. Mine is a pretty blown out one.. Very small processing power.

Later
Max
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12760
    
    5
Ah - I see I really had the wrong idea of the sequence of events.

If we really have to optimize the memory used, lets look at the need for Strings in the first place. If all your characters are in the ASCII set, ie one byte, perhaps your hash value could hold a byte[] instead of a String, using half the memory. The keys would still be String but the value would only be turned into a String as needed.

How big is the file of extracted database rows? That should give an idea of maximum memory use.

Bill
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

William Brogden wrote:If we really have to optimize the memory used, lets look at the need for Strings in the first place. If all your characters are in the ASCII set, ie one byte, perhaps your hash value could hold a byte[] instead of a String, using half the memory. The keys would still be String but the value would only be turned into a String as needed.

Never thought of things like these.. I understand that Strings are a bad usage... Would using a byte array be better than objects?? Because if I have it as a byte array, I have to convert it back to string and then split and do things. What would you suggest.
William Brogden wrote:
How big is the file of extracted database rows? That should give an idea of maximum memory use.

Of all the tables, only two tables are big. Each would be like 400 rows maximum. The big one being about 60000 rows. The big table (HashMap by the way) is accessed only by one or two rules.

Max
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12760
    
    5
I meant the file size in bytes. The question was really aimed at figuring the minimum memory requirements to get the whole thing into some sort of data structure in memory.

And there is always the complication of the possibility of Strings being intern()ed, thus taking less space whenever values are duplicated.

I think I would wait for the results of trying more memory before trying any tricks.

Incidently, since HashMaps are Serializable you might save startup time by storing a serialized object rather than generating the maps new each time.

Another thought - split a line to a String[] equivalent to your db columns once and store that as your value in the HashMap, that would take advantage of the intern()ed individual String values and minimize String operations during lookup.

Bill
Let us know how it works out, I love to hear about performance optimizations.
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

Yes, Increasing the memory did increase the performance a bit.. I see that changing from String to StringBuffer makes a lot of difference...

I am still hung up on Pat's comment as to what is running O(n^2).. I am profiling the code to find out whats happening...

Will let you know more as and when improvements occur..

-edit-
@Bill,
Yes, Putting String[] in the hashmap instead of String makes a lot of sense. Dont have to change much to get that.. That ought to reduce the split()s that I am doing inside the loop which could be one factor as per Pat's inference.

@everyone
Would you recommend putting in the String array or special holder objects inside the hashmap? Which one would you recommend keeping in mind both access and memory considerations.


Max
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12760
    
    5
It occurs to me that you should be building your extract from the database as a Serializable HashMap in the first place - skipping any creation and parsing of strings and never writing that text file.

Let the order of items in the String[] be that of the database columns and you essentially have high speed in memory lookup of a row.

Java serialization always surprises me with how fast it is - in my experiments with phonetic lookup I serialize a hashmap where the key is a phonetic code and the value is a String[] of words that generate that code. 74,550 common words in a 871KB serialized hashmap.

Bill


Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

William Brogden wrote:It occurs to me that you should be building your extract from the database as a Serializable HashMap in the first place - skipping any creation and parsing of strings and never writing that text file.

Well, I think I mentioned that the class I am writing is actually going to be called by another software. This software already does all the above mentioned things. I dont have to do anything.. The jar that I am creating gets all these databases and the input file as string. This class is just meant for the checks that I have to perform on the input that I get. This class is invoked by another software which already gets all the inputs from the user. It queries the databases, reads row as string by the way and passes everything required by the jar as string to it. From what I know, the software is pretty good at it. The timings I received show that all these are done in less than 2secs by the software - which then passes it on to the class that I write.

Again, the output also, I just have to pass out the output string. The software decides what all it should do with it. I think it just writes it out to a file. I am not sure. Meaning what I have is actually strings as input and strings as output. And lots of checks in between.

For ease of explanation - I explained my test scenario where I take the string inputs from a test main function which reads all the database files and passes the strings into the class in question.

I am sorry if that part of the system confused everyone.

William Brogden wrote:Let the order of items in the String[] be that of the database columns and you essentially have high speed in memory lookup of a row.
Bill

Yes, it is so currently. I changed the system to reflect that.



Java serialization is another area that I have not experimented with much. Does it mean that i serialize each hashmap and store them in a place - and refer them later? Will i be able to serialize just these datastructures..

Max
Maximus Moothiringus
Ranch Hand

Joined: Jun 07, 2008
Posts: 46

Hi All,

I put in some time to change the code. Now the HashMap stores a String array as the value. This means that I dont have to use split function for every check that I perform - that too for every input line. Once I did this, it greatly improved performance.

I changed all usages of append to StringBuffer. This also contributed to increasing the performance. Because lesser memory was used.

I also increased the heap size while running which also might have helped in cutting time.

Now the whole processing takes place within a minute - including all checks on the inptus and printing out the error messages. This is a really great improvement.

I think if I use intern() and other considerations that you put forward, I might be able to bring it down even further.

Lessons learnt:

  • Profile your code - Find out slower functions, find out memory leaks
  • Check how slower functions can be improved - if not how to call them less
  • Check how memory use can be minimised


  • Thanks everyone for your support!! I think the problem for me is now resolved!!
    Max
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: Need advice on improving performance
     
    Similar Threads
    Hibernate performance issues using huge databases
    best way of querying?
    SQL Performance Tuning - Release Announcement - Addison-Wesley
    Fetching and Manipulating Huge data from the database
    Table name as parameter to PreparedStatement