• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

File Processing Queries

 
Ranch Hand
Posts: 441
2
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All,

We have following requirements:

1. We will get the files at a common path from 3rd party systems
2. We will read the files placed at common path
3. And, perform some processing logic on data
4. Finally, dump into the database

Now, the main problem area is performance and scalability. The file sizes would vary from 10 GB - 60 GB. These files would contain millions of transactions.

So, my queries are:

1. What is the optimal approach to design the solution from performance & scalability perspective?
2. We need to process the files in minimum possible time and tomorrow it is expected that the file sizes will be doubled in next 4-5 years, so, the application should behave without any performance issues.

Tech Stack : Java 7, Spring, Oracle DB 11G, IBM Websphere App Server

Please share your experiences/thoughts to achieve this in best possible manner. Thanks in advance.

-VGarg
 
Bartender
Posts: 2911
150
Google Web Toolkit Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The implementation can be a simple daemon thread running continuously in the background checking for a fresh file.
When a file is found, it can trigger an event in your code to process the file.

Assuming that the files have no relation with each other, a separate thread can be spawned to process each file. You can use a thread pool to reuse the threads. Your routine to read the file can depend on what constitutes a single transaction, whether it's one line per transaction or some kind of separator that separates one transaction from another. A file's read speed will depend on how fast the IO supports reading from the disk or solid state device.
 
salvin francis
Bartender
Posts: 2911
150
Google Web Toolkit Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
On a side note, if this system is being designed right now, don't you think it's better to use the latest java version ? Java 7 was released on 2011 and its last update was in 2014.
 
Vaibhav Gargs
Ranch Hand
Posts: 441
2
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

salvin francis wrote:The implementation can be a simple daemon thread running continuously in the background checking for a fresh file.
When a file is found, it can trigger an event in your code to process the file.



Yes Salvin, I have created a poller which keeps on polling the common directory for any file and once a file is found, it triggers an event.

salvin francis wrote:
Assuming that the files have no relation with each other, a separate thread can be spawned to process each file. You can use a thread pool to reuse the threads. Your routine to read the file can depend on what constitutes a single transaction, whether it's one line per transaction or some kind of separator that separates one transaction from another. A file's read speed will depend on how fast the IO supports reading from the disk or solid state device.



Each record has a header & trailer record and in between, there can be N number of txns. So, not sure how can we solve this problem that we will read complete record using buffer and not an incomplete one. The file sizes currently is around 40-50 GB and in future it is expected to double.
 
Vaibhav Gargs
Ranch Hand
Posts: 441
2
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

salvin francis wrote:On a side note, if this system is being designed right now, don't you think it's better to use the latest java version ? Java 7 was released on 2011 and its last update was in 2014.



Unfortunately, we don't have a liberty to upgrade Java. All the systems are running on JRE7, so, we are bound to use that
BTW, do we have some better feature in JDK8 for such problems?
 
salvin francis
Bartender
Posts: 2911
150
Google Web Toolkit Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Vaibhav Gargs wrote:BTW, do we have some better feature in JDK8 for such problems?


yes, you can read about it here: https://docs.oracle.com/javase/tutorial/essential/io/fileio.html
Specifically, you can look at : https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#newBufferedReader-java.nio.file.Path-java.nio.charset.Charset-
 
salvin francis
Bartender
Posts: 2911
150
Google Web Toolkit Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Vaibhav Gargs wrote: ...Each record has a header & trailer record and in between, there can be N number of txns. So, not sure how can we solve this problem that we will read complete record using buffer and not an incomplete one....



So, If I understand correctly, you can read a file sequentially line by line, and when it encounters a specific set of character(s), it can be converted into a Record Object and processed. That's not too difficult right ?
The file size does not matter, you are just reading it a line at a time.
 
salvin francis
Bartender
Posts: 2911
150
Google Web Toolkit Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What if a 60gb file does not have the "trailer record" ? Will you load the complete file into memory ?    You probably need some guard condition against these types of scenarios.
 
salvin francis
Bartender
Posts: 2911
150
Google Web Toolkit Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

salvin francis wrote:...yes, you can read about it here: https://docs.oracle.com/javase/tutorial/essential/io/fileio.html..


I stand corrected here, the new NIO.2 was a part of java 7.

The lines method returning a Stream<String> is a part of java8:
https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#lines-java.nio.file.Path-
 
Vaibhav Gargs
Ranch Hand
Posts: 441
2
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Experts please share your views & experiences...
 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'd suggest your best bit is, if your current solution doesn't work quickly enough, is to get faster hardware. And make sure that the network connection between your processing machine and the database machine is fast and reliable.
 
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I merged your stuff with the following thread. I hope that is okay by you.
 
Vaibhav Gargs
Ranch Hand
Posts: 441
2
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
We are working on proposing a solution for the following system:

1. The system will receive some files at a shared path on daily basis. The file sizes will be really huge ranging  from 10-50 GBs. File formats can be text, csv as of now.
2. The files need to be read & dumped into the corresponding database tables after applying some business logics
3. Once it is persisted in the database, other systems can invoke the services - SOAP, REST, MQs with appropriate request and our system will respond to the requests.

We are looking for a solution which is performance oriented and there should not be any bottlenecks going forward while scaling the system.

Please share your thoughts on appropriate design, tech stack etc.
 
Consider Paul's rocket mass heater.
reply
    Bookmark Topic Watch Topic
  • New Topic