This week's book giveaway is in the Mac OS forum. We're giving away four copies of a choice of "Take Control of Upgrading to Yosemite" or "Take Control of Automating Your Mac" and have Joe Kissell on-line! See this thread for details.
I am a newbie to Spring and Spring Batch and don't know enough to know if the following is possible in Spring Batch.... We currently use Talend ETL for this but I am not 100% happy with it.
Database - mySQL. All linux machines.
Clients ftp flat files to multiple ftp servers. Every minute cron query checking scheduler table. On a 'hit' procedure triggered to check each ftp server client's directory for files and start processing. (Question: I could replace the scheduling with Quartz but looks like the cronExpression is set in a xml file so how to centrally manage and add new client processing scheduling params easily i.e. web interface?)
File picked up, backed up on file system and stored as a blob in database table determined by today's date and client Manufacturer (sending data for) abbreviation from scheduler table. e.g. ABC_2012_Nov. So a case of not knowing database until file is picked up. [For future scalability we look up an architecture schema table to determine on which machine this client's dbs are. (Currently all dbs are on one machine but the idea is that should a client's processing demands get too large we can setup another machine, move its dbs over and then update the architecture schema to new machine and away we go...)]
The FTP file processing procedure then, determines from architecture schema which machine to ssh to, does so and executes the next procedure - handing it the row id of logging table. This and subsequent procedures are now running on local client db machine.
Local Procedure looks up logging table row to get client details e.g. client number, file(s) names etc. Each file is then extracted from the ABC_2012_Nov table ready for conversion. To add to this we have instances where instead of one client per file we have a corporate file with multiple clients in the file. Currently this is handled by checking a db table to see if corporate type or not and then file parsed out to be client specific.
So at this point we know each file is client specific and db connection is set up per looped file e.g ABC_14_2012 (14 = client id). File row is converted into POJO and inserted (with duplicate data checks being done) into ABC_14_2012. Some queries to other db tables (table names are dynamic) are run to update/insert lookup row fields. [Though we have a fixed file layout schema not all clients adhere to it exactly. We have some (consistent per file) that are 314 in length, some 324 and some 329 in length. Another group use an edi like file layout - header, detail and footer section so Batch would need to know or determine which ItemReader schema(?) to use to create common to all, file layout POJO object representing file row.]
Next we do a count to see if we have any inserts that need validation and what type of validation e.g. public validation or private validation. If there are any of each - procedure now hands off to both or one etc.
Validation procedure looks up logging table to get client details e.g. which db to connect to and does so dynamically. Then for each row it processes it against the validation rules (examples of rules would be duplicates, invoice date test, quantity test, client test etc., some of which are connecting to another database and querying dynamically named client specific tables for these tests) and provided all pass - insert to other db tables.
Currently each procedure is segmented so that if one step fails - it can be re-run and following steps will then be called. This is why each looks up the logging table based on id given.
What I would like to see happen (if Spring Batch can do it) is:
Each ftp server act as Batch Master (for lack of a better word) that are both checking the same file processing scheduler. On hit each would backup the file(s) if any, inserting the file into the dynamically determined file db. Then hand off process to dynamically determined machine (idea of remote chunking?) while being kept updated via messaging on progress of node processing data (Batch Admin?) which could be seen via web interface.
Other thoughts: some of the db sql lookups ideally would be run in parallel.
As said don't know enough about Spring / Spring Batch to a) know if can be done b) not horrendously complicated forcing very steep learning curve. The 'Use Cases' on the Spring Batch website don't seem close enough to what I want and Spring Batch samples - don't know enough.
Yes. Spring Batch can do all that. I have one question, is the storing of the file into a Blob just for temporary storage before it is processed? If so, you don't have to do that in Spring Batch, you can go straight from the file to where ever the final destination.
Spring Batch can do remote chunking and parallel processing, but you do have to read up on it to learn it. But in the end it is just about configuration.
Hope that helps. Besides stating the obvious of reading the documentation. Spring Batch actually has the best documentation of all the Spring projects.