This week's book giveaway is in the OCPJP forum.
We're giving away four copies of OCA/OCP Java SE 7 Programmer I & II Study Guide and have Kathy Sierra & Bert Bates on-line!
See this thread for details.
The moose likes Performance and the fly likes Correct way to upload large data file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Correct way to upload large data file" Watch "Correct way to upload large data file" New topic
Author

Correct way to upload large data file

Varun Chopra
Ranch Hand

Joined: Jul 10, 2008
Posts: 211
Our client has a user facing web application running on Jboss. There is a separate admin application (in its own ear) but deployed on same Jboss server on which user facing web application is running.
They need a screen to upload large amount of data into database. Their original files were in excel with size > 60 mb.
We suggested following to them:

a. Change upload format to CSV - this brought down file sizes to 25-30 mb
b. Upload process will be MDB - asynchronous processing of data so that admin web app does not stop responding

We also suggested following to them:

a. Host admin app on a different machine so that user facing site does not respond slow during data processing
b. We can provide incremental upload feature and they should upload files in the chunks of 4-5 mb, specifically if they have user a web page to upload such files - they don't buy this argument though.
c. Data processing can be a separate script instead of a part of admin web application. They can FTP files to a designated location and this script will process those files.

I have following questions:

Q1 - Have you seen upload of such large datafiles to a web application? I see sites like Zoho CRM or Salesforce do not support such data imports and mostly fail or not respond.
Q2 - Is there a set of guidelines/best practices to upload large data files of this nature? How do insurance companies or others with enormous set of data accomplish such tasks (what is the architecture of such programs)?

Thanks


-Varun -
(My Blog) - Online Certifications - Webner Solutions
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42360
    
  64
Just to clarify, you're asking about uploading, but it seems that the problem is actually with the processing?

While a 30MB (or 60MB) file size isn't exactly small, I don't think uploading it should cause any particular problem assuming that the I/O on the server is written efficiently. (If you are concerned about file size, CSVs might be compressed considerably by ZIP or GZIP.)

Processing is a different matter. Certainly if you run CPU-intensive jobs on a webserver it's going to impact the performance of the site. There are lots of ways to tackle this: use a different machine, use "renice" to make sure the job doesn't eat a lot of CPU time, investigate why it uses so much CPU time and try to make it more efficient, move processing to the night so it doesn't impact users, ...


Ping & DNS - my free Android networking tools app
Varun Chopra
Ranch Hand

Joined: Jul 10, 2008
Posts: 211
Thanks for reply Ulf. You are right, main problem is processing on server. And zipping the file before upload will reduce the load on server to receive/save the file before processing starts.

Thanks for other ideas as well. Renice seems like a good thing, and I am definitely advocating for a separate/dedicated machine for such processes because they won't agree to uploading/processing data at night.


Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8043
    
  22

Varun Chopra wrote:Thanks for reply Ulf. You are right, main problem is processing on server. And zipping the file before upload will reduce the load on server to receive/save the file before processing starts.

Right, but it does add a bit to processing (CPU) time, which may not help if the process is CPU-bound rather than IO-bound (which is, admittedly, more common).

Thanks for other ideas as well. Renice seems like a good thing...

Hmm. It might do, but personally I'd only use it as a last resort.

Assuming that it's a cousin of 'nice', you have a few things to consider:
1. It's probably only going to work on Linux.
2. It may compromise the system's ability to process all processes optimally, and may also add some CPU weight.
3. It may not do anything at all. As I recall, nice does not guarantee that the OS will change the time-slicing for your process; it's simply a suggestion, rather like the System.gc() command in Java.

You're usually fine if you only use it to decrease priorities; but you can impact global system performance if you us it to increase them.

However, it's been a while since I was a sysadmin on Linux, so I may be out-of-date.

HIH

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Napa Sreedhar
Ranch Hand

Joined: Jan 29, 2002
Posts: 62
Use load tools specific to the database for loading large files. (if more data logic is involved you might need an ETL script to trigger the loading process)

If you need a screen, ETL script needs to share job start times, end times, file location etc


Varun Chopra
Ranch Hand

Joined: Jul 10, 2008
Posts: 211
Thanks Winston and Napa for your thoughts.
I was trying to understand how other systems are doing it and what's the right architecture. I am sure this is a common problem.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8043
    
  22

Varun Chopra wrote:I was trying to understand how other systems are doing it and what's the right architecture. I am sure this is a common problem.

Yes it is, but unfortunately, there's no "right architecture". Believe me, if there was, the inventor of it would be as well-known (and rich) as Bill Gates.

Like all "throughput" problems, you need to study your current process and architecture in detail and work out where the bottlenecks are.

Pretty much everything that Ulf gave you in his post are well-known remedies and, of them all, I would say that compression is the most generic; but if most of your payload is made up of, for example, jpeg images, it probably won't make any difference at all.

Throwing extra hardware (eg, dedicated servers) and memory at the problem can also be very cost-effective, but you need to be sure that you do it rationally. A dedicated server, for example, can be badly let down by a lack of connection/network bandwidth.

Another possibly alternative along the same lines: Partition your system.
Most modern Linuxes, and many Unixes (don't know about Windows), allow you to split your OS into multiple virtual machines that behave like separate systems. This allows you to concentrate hardware upgrades on a single box, and put critical "pipeline" processes in separate VM's which communicate at memory speed.

But, as I say, it's just an alternative.

HIH

Winston
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Correct way to upload large data file