*
The moose likes Java in General and the fly likes A Design Question Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "A Design Question" Watch "A Design Question" New topic
Author

A Design Question

Zahid Maqbool
Greenhorn

Joined: Jun 30, 2011
Posts: 4
Hi,

I have got a problem which has left me banging my head against the wall. Anyways, here is my situation.

I've got to scan for a directory continuously and as soon as I've got a CSV file in it. I need to process that CSV file and pass on some data from it to another application, since this external application can only process only limited data at a time, I can only send some data from the CSV file, when its done processing this data, I need to send some more and so on and so forth.

Till here it looks pretty straight forward, however the problem now is that I could have multiple CSV files coming into the directory, though all of them follow a standard format. To add to the problem, I need to send data to the external application based on the percentage of records from each CSV file. For e.x here is the structure of the CSV file

CustomerID, EmailID, ContactNumber, campaignID

CampaignID remains the same for all records in one file, i.e. each file has its own unique CampaignID.

Now lets say there are 3 files in the directory ,i.e. File A, File B and File C, with the number of records in each file being 5000, 3000, 2000 respectively and that I can only send 300 records at a time to the external application.

Now when I send 300 records, I need to populate it from all 3 files based on their total percentage ,i.e. 150 from file A(50%), 90 from File B(30%), 60 from File C(20%).
I need to also keep track of the records I send, so I don't send them again.
Also the external application gives me a response back later in the day for each campaignID in another CSV file. So I need to store these records from the Files in database as well and provide reporting.

The interface option with the external application is either through ActiveMQ or WebServices. I have experience with both of these so that's not a problem. The problem is in choosing the right approach to send records percentage wise and maintaining a track of it. Also I am not sure when to load Data into the database, should I do it after processing the file or before processing and then read from the database instead of the file.

I need to design this keeping in mind that there could be upto 500 CSV files coming in a short time of an hour with them having a total of million records in them. I will really appreciate for any help.
Wim Vanni
Ranch Hand

Joined: Apr 06, 2011
Posts: 96

Welcome to the Ranch, Zahid.

Your design question is quite a challenge and I'm letting it sink in and have it run in a background thread for a while. I'll get back to you as soon as that thread comes up with some conclusions

Wim
Zahid Maqbool
Greenhorn

Joined: Jun 30, 2011
Posts: 4
Wim Vanni wrote:Welcome to the Ranch, Zahid.

Your design question is quite a challenge and I'm letting it sink in and have it run in a background thread for a while. I'll get back to you as soon as that thread comes up with some conclusions

Wim


Ok. I'll wait for your response and try to think more in the meantime about this. Thanks.
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11257
    
  16

There are a couple of things to keep in mind...

each time you get ready to send the next batch of records, you need to do a few things:

1) find the total number of records
2) find out how many records are in each file
2a) build a list of all the files...
3) compute how many records from each file need to go
4) open the necessary files and read the correct number of records
5) send the records across
6) verify the records made it(?)
7) mark off which records you've sent.

How fault-tolerant do you need to be? In other words, if you send 150 of the 300, and your program bombs, how bad is it if you re-send the same 150 again?

There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Zahid Maqbool
Greenhorn

Joined: Jun 30, 2011
Posts: 4
fred rosenberger wrote:

How fault-tolerant do you need to be? In other words, if you send 150 of the 300, and your program bombs, how bad is it if you re-send the same 150 again?


If I resend the 150 again, they are going to shoot me!!

I also need to store these into a database, would you recommend storing after all the files are processed or before processing or as each batch of records are sent?
Wim Vanni
Ranch Hand

Joined: Apr 06, 2011
Posts: 96

3 processes:
1. Observe/Pickup
2. Scheduled Enqueueing
3. Listen for success/failure

1. Observe/Pickup
This process has the single purpose of picking up the csv files, parsing the data they contain and store that data in the database (status: new).

2. Scheduled Enqueueing
Here a scheduled process (each minute for example) gets the necessary numbers on that table:
- total amount of records that are to be processed (status: new)
- amount of records per CampaignID (status: new)
Create a subset (300 records as you mentioned) that follows your percentage requirements and enqueue these records. (Note that enqueueing should be understood as a neutral 'forward it to the external app' process; could be MQ or webservices or ..)
Each record gets a new status: enqueue_error or enqueued.

3. Listen for success/failure
(Hopefully a success or failure message comes back from the external app.) Catch this returning message and update the status of the record accordingly.

----

The above is somewhat along the lines of what Fred wrote. Obviously you'll have to adapt to the details of your situation.

I wanted to add that I'm puzzled by the "max 300 at a time" and "percentage-wise spread over available csv's" requirements. The first seems unnecessary if you push data to the app with a webservice or MQ; or is it that 300 records in one batch get sent? The second must have something to do with not using a first-in-first-out (FIFO) system to comply with some functional requirement.

Anyway, hope this helps.

Cheers,
Wim
Zahid Maqbool
Greenhorn

Joined: Jun 30, 2011
Posts: 4
Wim Vanni wrote:
I wanted to add that I'm puzzled by the "max 300 at a time" and "percentage-wise spread over available csv's" requirements. The first seems unnecessary if you push data to the app with a webservice or MQ; or is it that 300 records in one batch get sent? The second must have something to do with not using a first-in-first-out (FIFO) system to comply with some functional requirement.

Anyway, hope this helps.

Cheers,
Wim


Thanks for your detailed reply. This seems to have solved most of my problems. yeah this maximum 300 thing at a time is a limit set if I push it via the web service.Using the MQ I can push any numbers but I still need to make sure they are distributed percentage wise. The external app is sending SMS to customers and it needs to serve different campaigns giving all campaign business owners the feel that there campaign is also running, that's the reason why no FIFO.

I just have one problem now, how do I get percentage wise records from the database, I can't really come up with an efficient way and then I thought to use the RAND() function, i.e.



I ran some tests using the above query and more-ever the percentage distribution is almost the same. Any problem you see with the above method?
Arun Chidam
Ranch Hand

Joined: May 08, 2006
Posts: 79
Hi Zahid,

We do something similiar for our system generated status messages....

Yes setup/design is somewhat similiar to what "Wim" specified, but i see there is a bottle neck in his design where he stores all records in d/b in the 1step.....i feel it will lead to too much load on database read and write, especially when you think about 1000 of records recieved .....here are my suggestion.


Spilt the process into 4 steps.

1st Process
------------------------
a)identify each file which needs to be processed from the input folder(say folder "messageIn" folder)
b)loop through each file and store the count(no of lines) and name in a hash map
c)Identify the perentage on each file to be retrieved as per the threshold(i.e 300 in your case)
d)Now navigate through each file and inserts the record into a another file (name should be used to identify the time it created) and remove the line from the existing file
e)Store the new file in a seperate folder(say MessageOut folder)
f)Repeat this process(say for every 5mins)

2nd Process
----------------------------
a)Pick each file as per time created from the MessageOut folder
b)Navigate through each message and post to your external application, store the record in the database with the flag "sent" as true
c)if it fails for some reason then store it in database, but set "sent" flag as false(inorder to try later)
d)once all messages are sent move the file to different folder(say folder MessageSent)
e)Repeat this process with the specfied time interval

3rd process
--------------------------------
a)Wait for response message from your external application
b)read each acknowledgment message from the CSV and update those records in database with another flag(say confirmed)
c)repeat this process

4th process(dunno if it is required)
------------------------------
a)identify those message which are not sent using the sent flag a false
b)resent those messages and update the records
c)again repeat this process







SCJP 1.4, SCWCD 1.4
Wim Vanni
Ranch Hand

Joined: Apr 06, 2011
Posts: 96

Sounds like moving the performance issue ('load') from database to the filesystem I/O to me ;-)

In the second step I mentioned getting some statistics: total amount and amount per campaign ID. These wil serve you to compute amount you will send per campaign; much like the math you used in your example, pretty straightforward.

Wim
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: A Design Question