Our client has a user facing web application running on Jboss. There is a separate admin application (in its own ear) but deployed on same Jboss server on which user facing web application is running.
They need a screen to upload large amount of data into database. Their original files were in excel with size > 60 mb.
We suggested following to them:
a. Change upload format to CSV - this brought down file sizes to 25-30 mb
b. Upload process will be MDB - asynchronous processing of data so that admin web app does not stop responding
We also suggested following to them:
a. Host admin app on a different machine so that user facing site does not respond slow during data processing
b. We can provide incremental upload feature and they should upload files in the chunks of 4-5 mb, specifically if they have user a web page to upload such files - they don't buy this argument though.
c. Data processing can be a separate script instead of a part of admin web application. They can FTP files to a designated location and this script will process those files.
I have following questions:
Q1 - Have you seen upload of such large datafiles to a web application? I see sites like Zoho CRM or Salesforce do not support such data imports and mostly fail or not respond.
Q2 - Is there a set of guidelines/best practices to upload large data files of this nature? How do insurance companies or others with enormous set of data accomplish such tasks (what is the architecture of such programs)?
Just to clarify, you're asking about uploading, but it seems that the problem is actually with the processing?
While a 30MB (or 60MB) file size isn't exactly small, I don't think uploading it should cause any particular problem assuming that the I/O on the server is written efficiently. (If you are concerned about file size, CSVs might be compressed considerably by ZIP or GZIP.)
Processing is a different matter. Certainly if you run CPU-intensive jobs on a webserver it's going to impact the performance of the site. There are lots of ways to tackle this: use a different machine, use "renice" to make sure the job doesn't eat a lot of CPU time, investigate why it uses so much CPU time and try to make it more efficient, move processing to the night so it doesn't impact users, ...
Joined: Jul 10, 2008
Thanks for reply Ulf. You are right, main problem is processing on server. And zipping the file before upload will reduce the load on server to receive/save the file before processing starts.
Thanks for other ideas as well. Renice seems like a good thing, and I am definitely advocating for a separate/dedicated machine for such processes because they won't agree to uploading/processing data at night.
Varun Chopra wrote:Thanks for reply Ulf. You are right, main problem is processing on server. And zipping the file before upload will reduce the load on server to receive/save the file before processing starts.
Right, but it does add a bit to processing (CPU) time, which may not help if the process is CPU-bound rather than IO-bound (which is, admittedly, more common).
Thanks for other ideas as well. Renice seems like a good thing...
Hmm. It might do, but personally I'd only use it as a last resort.
Assuming that it's a cousin of 'nice', you have a few things to consider:
1. It's probably only going to work on Linux.
2. It may compromise the system's ability to process all processes optimally, and may also add some CPU weight.
3. It may not do anything at all. As I recall, nice does not guarantee that the OS will change the time-slicing for your process; it's simply a suggestion, rather like the System.gc() command in Java.
You're usually fine if you only use it to decrease priorities; but you can impact global system performance if you us it to increase them.
However, it's been a while since I was a sysadmin on Linux, so I may be out-of-date.
Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Varun Chopra wrote:I was trying to understand how other systems are doing it and what's the right architecture. I am sure this is a common problem.
Yes it is, but unfortunately, there's no "right architecture". Believe me, if there was, the inventor of it would be as well-known (and rich) as Bill Gates.
Like all "throughput" problems, you need to study your current process and architecture in detail and work out where the bottlenecks are.
Pretty much everything that Ulf gave you in his post are well-known remedies and, of them all, I would say that compression is the most generic; but if most of your payload is made up of, for example, jpeg images, it probably won't make any difference at all.
Throwing extra hardware (eg, dedicated servers) and memory at the problem can also be very cost-effective, but you need to be sure that you do it rationally. A dedicated server, for example, can be badly let down by a lack of connection/network bandwidth.
Another possibly alternative along the same lines: Partition your system.
Most modern Linuxes, and many Unixes (don't know about Windows), allow you to split your OS into multiple virtual machines that behave like separate systems. This allows you to concentrate hardware upgrades on a single box, and put critical "pipeline" processes in separate VM's which communicate at memory speed.