aspose file tools*
The moose likes Threads and Synchronization and the fly likes Structuring a Multi-Threaded Application: Parse from .csv and write to a database Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Threads and Synchronization
Bookmark "Structuring a Multi-Threaded Application: Parse from .csv and write to a database" Watch "Structuring a Multi-Threaded Application: Parse from .csv and write to a database" New topic
Author

Structuring a Multi-Threaded Application: Parse from .csv and write to a database

Zak Tacc
Greenhorn

Joined: Feb 01, 2010
Posts: 25
I'm up for a job and there's a pre-interview assignment that's got me a bit confused. I've never done anything multi-threaded before, but I got to wow my interviewer with this assignment.


The assignment:

"Write a Java program that reads a csv file, parses the data, distributes the data to multiple threads, where each line of data in the file is written to a database row. The program should not exit until the entire file is processed and written to the database. Each data set should be assigned to a specific thread, and only the owning thread should process the data. Metrics should be kept in terms of success/failures. Failures should be placed in a cache and retried once and only once by a thread other than the thread that made the original attempt."


The assignment is very open-ended, so I'm supposed to come up with my own .csv for it. So far I was thinking that I could parse a .csv (thinking of using the Super CSV framework for this) of a bunch of user data (username/password/date of birth etc.) and then write it to a MySQL database. What's got me scratching my head is the multi-threaded aspect of this project.
Important Questions
  • How should I structure this assignment? (how many classes/what would be the function of each class)
  • How should I incorporate the multi-threaded aspect of this project? (how many threads/what will each thread be responsible for?)


  • I really want to do well on this assignment, so any help you could give would be greatly appreciated.

    Thanks in advance
    Joe Areeda
    Ranch Hand

    Joined: Apr 15, 2011
    Posts: 318
        
        2

    Zak,

    If I were the interviewer looking at an assignment like that I'd spend about 2 minutes watching you run the program and about 10 minutes looking at your code then maybe if I liked what I saw another 10-15 minutes talking to you about.

    So first of all KEEP IT SIMPLE and make sure it works.

    I'd be much more impressed by some JUnit tests and javadoc entries than by how complicated the csv file is or how many columns in the mysql table.

    When I used to hire programmers before retiring, I wanted most of all people that could communicate with me and the other programmers. Then I wanted code that was maintainable.

    The trick to multi-threaded applications is that they can't be written in a debugger, they must be designed properly. Threading bugs are very difficult to duplicate because they depend on timing.

    So I would plan on 3 threads with 2 background classes one to read the file, another to write to the DB, one to represent a row, and whatever you need for the user interface. I would first debug everything in a single thread then create one background thread to read the csv and another to insert data into the database. It sounds from the spec that they want multiple threads to do simultaneous inserts into the database. I can't imagine why but it's easy.

    The basic structure I would use would be based on a LinkedBlockingQueue (http://download.oracle.com/javase/6/docs/api/java/util/concurrent/LinkedBlockingQueue.html).

    The GUI would allow you to open a file with a menu command then
    -Create a linked blocking queue with enough entries to support the number of tasks you choose (say 3 times the number of tasks)
    -Create a thread to run a CSVReader task that put Rows into the queue
    -Create multiple DBInserter threads than pulled records from the queue and inserted them into the DB.

    There are a few ways to terminate something like this. I would probably pass the CSVReader a list of all the DBInserter tasks it was feeding and have it set a Done flag when it reached the end of file.

    Just an idea. If you want to discuss it or ignore it fine with me.

    Joe


    It's not what your program can do, it's what your users do with the program.
    Zak Tacc
    Greenhorn

    Joined: Feb 01, 2010
    Posts: 25
    That helps a lot thank you. I definitely would have done it differently now had I read your thoughts.


    The way I structured it was that I grabbed the .csv, read a line from it, passed that to a CSVparser class (which extends Thread) where a new thread is started. In this thread, the line from the .csv file (a row) is parsed and inserted into a database.

    So that gives me 248 threads, a thread for each row in the database.

    Was the a bad way to design it? And more importantly, is it worth redesigning you think?


    Thank you

    EDIT: I'm also a bit worried cause nothing failed, because the assignment seems to imply that something should fail.
    Joe Areeda
    Ranch Hand

    Joined: Apr 15, 2011
    Posts: 318
        
        2

    Zak Tacc wrote:The way I structured it was that I grabbed the .csv, read a line from it, passed that to a CSVparser class (which extends Thread) where a new thread is started. In this thread, the line from the .csv file (a row) is parsed and inserted into a database.

    So that gives me 248 threads, a thread for each row in the database.

    Was the a bad way to design it? And more importantly, is it worth redesigning you think?


    Thank you

    EDIT: I'm also a bit worried cause nothing failed, because the assignment seems to imply that something should fail.


    Well in my opinion the main reason to give an exercise like this to a job applicant is to see how they think, how understandable and maintainable their code is, and to see how they present and discuss it. I doubt they're expecting production quality design and implementation.

    There are 2 issues, I can think of, with you approach.

    One is efficiency. Each record has to start a thread and each thread has to open and close a database connection. That will take longer than the insert. So I would have limited the number of threads to something much smaller. If you were to do a series timing tests and plot total time versus number of threads, I think you'll see the time go down almost linearly as you go from 1 thread to 2 or 3 times the number of cores on your system, then continue down to some point then start to rise again. So there is an optimal number and I'd guess it's less than 248.

    The other problem is scalability. If there is no limit on the size of of the input csv file the number of threads will keep growing until you run out of memory or database connections or bump into some limit.

    Whether or not it's worth the effort to improve your program is something I won't even guess at.

    I like to write programs using what is now called Extreme Programming and used to be called Rapid Prototyping with Stepwise Refinement. The idea behind both is to get something working as quickly as possible so the designers and the end users have something real to compare against the requirements. Then concentrate your efforts on the big issues that people notice.

    It's 50-50 in my mind whether a slight improvement in the code will count more than a good discussion of how to improve the code. In other words no matter how much time you spend on this, have the attitude of "let's see how it works in real life and then we'll know how to make it better" as opposed to "this is great stuff, love it or pound sand".

    I'm not surprised a few hundred inserts into a clean database produced no errors. You could generate one by doing something like have the unique primary key in the csv file then have one or two duplicates. Maybe you could set the number of simultaneous connections mysql allows to a small number. I really wouldn't worry too much about not testing the error handling code as long as it exists and you can talk about it.

    Again, I'm just one guy giving his opinion. I'm not claiming to have The Answer.

    Good luck and be sure to report back on how the interview went.

    Joe
    Joanne Neal
    Rancher

    Joined: Aug 05, 2005
    Posts: 3742
        
      16
    Zak Tacc wrote:a CSVparser class (which extends Thread)

    ExtendingThreadVsImplementingRunnable


    Joanne
    Joe Areeda
    Ranch Hand

    Joined: Apr 15, 2011
    Posts: 318
        
        2



    Joanne suggests some good reading.

    I'd probably also add the tutorial on thread pooling http://download.oracle.com/javase/tutorial/essential/concurrency/pools.html

    I don't have a good link but some research on deadlocks and semaphore usage will probably be good.

    I still think a lot of working making refinements to your program is of questionable value. If the interviewer is any good she will be able to get a good idea of your level of experience and understanding of the topic.

    I could be wrong, who knows maybe this selection process will pass these programs on to members of the team who vote on the best program and hire that person. I think that would be a bad way to do it in the long run but they haven't asked me how to hire people.

    Joe

     
    Consider Paul's rocket mass heater.
     
    subject: Structuring a Multi-Threaded Application: Parse from .csv and write to a database