• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Correct way to upload large data file

 
Ranch Hand
Posts: 213
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Our client has a user facing web application running on Jboss. There is a separate admin application (in its own ear) but deployed on same Jboss server on which user facing web application is running.
They need a screen to upload large amount of data into database. Their original files were in excel with size > 60 mb.
We suggested following to them:

a. Change upload format to CSV - this brought down file sizes to 25-30 mb
b. Upload process will be MDB - asynchronous processing of data so that admin web app does not stop responding

We also suggested following to them:

a. Host admin app on a different machine so that user facing site does not respond slow during data processing
b. We can provide incremental upload feature and they should upload files in the chunks of 4-5 mb, specifically if they have user a web page to upload such files - they don't buy this argument though.
c. Data processing can be a separate script instead of a part of admin web application. They can FTP files to a designated location and this script will process those files.

I have following questions:

Q1 - Have you seen upload of such large datafiles to a web application? I see sites like Zoho CRM or Salesforce do not support such data imports and mostly fail or not respond.
Q2 - Is there a set of guidelines/best practices to upload large data files of this nature? How do insurance companies or others with enormous set of data accomplish such tasks (what is the architecture of such programs)?

Thanks
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Just to clarify, you're asking about uploading, but it seems that the problem is actually with the processing?

While a 30MB (or 60MB) file size isn't exactly small, I don't think uploading it should cause any particular problem assuming that the I/O on the server is written efficiently. (If you are concerned about file size, CSVs might be compressed considerably by ZIP or GZIP.)

Processing is a different matter. Certainly if you run CPU-intensive jobs on a webserver it's going to impact the performance of the site. There are lots of ways to tackle this: use a different machine, use "renice" to make sure the job doesn't eat a lot of CPU time, investigate why it uses so much CPU time and try to make it more efficient, move processing to the night so it doesn't impact users, ...
 
Varun Chopra
Ranch Hand
Posts: 213
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for reply Ulf. You are right, main problem is processing on server. And zipping the file before upload will reduce the load on server to receive/save the file before processing starts.

Thanks for other ideas as well. Renice seems like a good thing, and I am definitely advocating for a separate/dedicated machine for such processes because they won't agree to uploading/processing data at night.


 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Varun Chopra wrote:Thanks for reply Ulf. You are right, main problem is processing on server. And zipping the file before upload will reduce the load on server to receive/save the file before processing starts.


Right, but it does add a bit to processing (CPU) time, which may not help if the process is CPU-bound rather than IO-bound (which is, admittedly, more common).

Thanks for other ideas as well. Renice seems like a good thing...


Hmm. It might do, but personally I'd only use it as a last resort.

Assuming that it's a cousin of 'nice', you have a few things to consider:
1. It's probably only going to work on Linux.
2. It may compromise the system's ability to process all processes optimally, and may also add some CPU weight.
3. It may not do anything at all. As I recall, nice does not guarantee that the OS will change the time-slicing for your process; it's simply a suggestion, rather like the System.gc() command in Java.

You're usually fine if you only use it to decrease priorities; but you can impact global system performance if you us it to increase them.

However, it's been a while since I was a sysadmin on Linux, so I may be out-of-date.

HIH

Winston
 
Ranch Hand
Posts: 62
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Use load tools specific to the database for loading large files. (if more data logic is involved you might need an ETL script to trigger the loading process)

If you need a screen, ETL script needs to share job start times, end times, file location etc


 
Varun Chopra
Ranch Hand
Posts: 213
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Winston and Napa for your thoughts.
I was trying to understand how other systems are doing it and what's the right architecture. I am sure this is a common problem.
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Varun Chopra wrote:I was trying to understand how other systems are doing it and what's the right architecture. I am sure this is a common problem.


Yes it is, but unfortunately, there's no "right architecture". Believe me, if there was, the inventor of it would be as well-known (and rich) as Bill Gates.

Like all "throughput" problems, you need to study your current process and architecture in detail and work out where the bottlenecks are.

Pretty much everything that Ulf gave you in his post are well-known remedies and, of them all, I would say that compression is the most generic; but if most of your payload is made up of, for example, jpeg images, it probably won't make any difference at all.

Throwing extra hardware (eg, dedicated servers) and memory at the problem can also be very cost-effective, but you need to be sure that you do it rationally. A dedicated server, for example, can be badly let down by a lack of connection/network bandwidth.

Another possibly alternative along the same lines: Partition your system.
Most modern Linuxes, and many Unixes (don't know about Windows), allow you to split your OS into multiple virtual machines that behave like separate systems. This allows you to concentrate hardware upgrades on a single box, and put critical "pipeline" processes in separate VM's which communicate at memory speed.

But, as I say, it's just an alternative.

HIH

Winston
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic