we are faced with the problem of 100% cpu utilization in our client's production server. The server would always need to be restarted or else everything will hang. Based on investigation, it was found out that this happens during peak hours of usage of our application deployed in the server. To find out the cause, we got hold of the core dumps generated in the server and use an analyzer to view it. Based on the results, we did some code optimizations and rework in order to solve the issue. The thing is, this workaround didn't work. From the start, we already had the the hunch that server hardware upgrade was needed and already recommended this to the client and since they don't want to do this due to cost issues, they keep "pushing" us to do whatever we can to the application to have it optimized.
the question is, when do you draw the line between an application optimization solution and server upgrade solution. When do you say that "ok, this server really needs a hardware upgrade, no amount of optimization in our application can do anything to fix this problem". What kind of smoking gun proof can be shown to the client to convince them to upgrade their servers?
and are there good resources on the net or books where I can read and learn about these kind of stuff? I've been recently promoted into this lead architect in our development team and to be honest, I'm still inexperienced solving these kinds of issues.
Start with some numbers. What is the CPU power now? Is anything running on it that can be offloaded? Can you add clones? (This costs money because you need another server.) How much does the upgrade or second server cost in comparison with your team's salary? In other words, how many hours/days/weeks, do you need to work on "optimizing the code" before it is cheaper to upgrade the server.
Not to say the answer is always to throw hardware at the problem. You should look at where the bottleneck is. At some point, there won't be any low hanging fruit left though and you need to have the above discussion
Your biggest problem is that you are doing this on production. You are asking the right question "What kind of data could be provided to our client to convince them to upgrade the server instead?" Yes that is exactly the question you should be asking, but this question should have been asked before the software was released to the client, not after it was released to the client. Ultimately it comes down to sizing. Ideally, in the beautiful little ideal world that exists in my mind when I build scalable software is, before engineers release the software to anyone outside their group, they should have a good idea of how much hardware they need, have a good idea of how the application will behave under no-load, moderate load, and heavy load, and also given the hardware at what load will the application break. Once you have this data, you can go to your client and tell them this is what we have measured; this is the limits of the application based on your hardware; once the load start crossing limits, this is how you need to size up your hardware; and under no condition you can exceed the load to be more than this. Yes, there are always improvements that you can make in the software, and once you have this data, you can use that to improve your own throughput. Having this data ready before you deploy the software to the client also makes the client feel confident that yes, we need to invest this much money in the hardware, but the engineers look like they got their shit together and we are confident that they would be working on reducing the hardware footprint in the next release.
So, coming to your question "What kind of data should be collected?" Ultimately, any process running in any computer needs 3 things :- CPU, memory, IO. Under given load, it will use some CPU, some memory and some IO. Also, you will have some throughput:-ie, under certain load, doing a unit of work will require some amount of time . An unit of work could be one request or it could be 100 bytes of data processed. It depends on what your application is. Generally, in web application, an unit of work is defined as one HTTP request. So, for each HTTP request, you measure how much CPU, memory and IO it consumes, and how long it takes. THis is your low-load performance characteristic of your application:- One piddly HTTP request, measure how it behaves. Now, generally speaking, as you increase load, your CPU usage, memory usage and IO usage will increase. 2 simultaneous requests will generally use double the CPU, memory and IO usage than 1 request.. right? 10 times the load, 10 times the resource usage, right? Well, at some point, you will reach the brick wall. You are not going to create more memory than what is physically assigned to your process. You are not going to create more CPUs than those physically assigned to your hardware. SO, you need to measure your load characteristics;ie; as you increase load how does your CPU, memory and IO usage increases. Oh, also while you are measuring resource usage under load, also measure throughput. Ideally as long as resource usage is under maximum, your throughput should stay constant.. However, practically, there are always bottlenecks in any application, and your throughput might reduce slightly as load increases. You need to measure high load throughput too
This is what you need to give to your client. Basically, you are saying, to service 10K requests per second, I need this much memory/CPU/IO. If you don't have that much memory, go buy some more. You don't have money to buy memory, don't accept these many requests. Yes, they can always come back and say "Oh I think it needs too much memory. Instead of needing 1M per request, it should take 100 bytes per request". AT this point, you either tell them "yes, if we do this in 100 bytes per request, these are the tradeoffs, we will need to make, and this is how the application will detoriate" , or you tell them "yes, I absolutely agree with you, we are woking on improving performance, and this is the plan that we have in place to improve performance. Please invest in hardware/limit your load for some time while we improve our performance"
Joined: Aug 11, 2005
Thanks a lot for your comprehensive answer to my question Jayesh. I appreciate it. Can you recommend any books or online materials where I could learn more about these kind of stuff? I feel I will have to deal with more issues like this in the future.