aspose file tools*
The moose likes Performance and the fly likes Parsing thousands of files. Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Performance
Bookmark "Parsing thousands of files." Watch "Parsing thousands of files." New topic
Author

Parsing thousands of files.

Himanshu Gupta
Ranch Hand

Joined: Aug 18, 2008
Posts: 598

I have thousands of files which I have to parse and collect the ip and domains.

My approach to do this is as follows:
  • Collect the paths to the files and store then in a thread safe list.
  • Spawn 5-6 Threads which will read and remove the file name, make the file object and will start reading the file content.
  • Use regex to find the information in a given pattern and if found store them in an in memory object.
  • Continue reading from list until it is not empty.


  • Is there any better way to do this kind of stuff?
    If yes then please share.

    Also for the in memory results which I am going to keep is there some better way to handle that. The in memory object which is designed to contain the results will grow in size and we may need to flush it and write the results to some file after its size goes beyond some decided value.

    Moreover how to decide how many threads will be useful? Can we have some logic depending upon the number of files in list to be scanned?

    Thanks in advance for reading this and sharing your thoughts.

    Himanshu Gupta


    My Blog SCJP 5 SCWCD 5
    Ralph Cook
    Ranch Hand

    Joined: May 29, 2005
    Posts: 479
    It strikes me that you might be creating a pathological case. Seeking is an expensive disk operation, and by having multiple threads read from different files, you might be setting up a situation in which one thread does a read, then another thread takes over, makes the disk seek to a different cylinder, do a read, then the first thread gets control back, makes the disk seek back to the original place, etc. Whether this slows things down, or slows them down significantly, depends on your environment, but it seems to me to be possible.

    Given that, I would probably have one thread do the file reads. I would have it read a large memory buffer, and then hand that buffer to another class in a different thread to parse.

    If I could, I would have the parsing class not care whether it got the entire file; its job is to parse that buffer and report on it; if it's an entire file, great, if not, the next large buffer's worth of that file will be parsed by the next object.

    This way, you overlap the things that are compute-intensive (reading bytes and examining them for your strings) with the reads without risking extra seeks and still getting the benefit of parsing while waiting for I/O to complete.

    If it is practical, I would also defrag the disk before I started; hopefully that optimizes the disk reads...

    rc

    William Brogden
    Author and all-around good cowpoke
    Rancher

    Joined: Mar 22, 2000
    Posts: 12769
        
        5
    I like that approach but I don't think there is any way to reason your way to an optimum solution, you are just going to have to measure stuff.

    I think Ralph pointed out the problem of having more than one Thread reading, but this gets involved with operating system peculiarities.

    Between file reading, conversion to UNICODE, and REGEX you have 3 expensive processes, only experimentation can guide you. Perhaps instrumenting your code with JAMON or similar can gather the information.

    Please report back what you found, this is a fascinating problem.

    Bill
    fred rosenberger
    lowercase baba
    Bartender

    Joined: Oct 02, 2003
    Posts: 11231
        
      16

    I am not an expert on threads by ANY means, but my understanding is that "multi-threading" is not a panacea, and can indeed slow things down. Where it helps is when you have lots of idle time while waiting for stuff - input from users, disk reads, etc.

    If you write your program single-threaded and then do some profiling, you may find that the disk is being used 100% of the time. Your reader/parser code might be waiting a lot for those disk reads. Having more than one might not do any good, if one can keep up with the disk reads.

    So, I'd think the trick is to find out where the bottle neck is (as is true with all optimization), see who is waiting for who, and go from there.


    There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
    Ralph Cook
    Ranch Hand

    Joined: May 29, 2005
    Posts: 479
    William Brogden wrote:I like that approach but I don't think there is any way to reason your way to an optimum solution, you are just going to have to measure stuff.

    I think Ralph pointed out the problem of having more than one Thread reading, but this gets involved with operating system peculiarities.

    Between file reading, conversion to UNICODE, and REGEX you have 3 expensive processes, only experimentation can guide you. Perhaps instrumenting your code with JAMON or similar can gather the information.

    Please report back what you found, this is a fascinating problem.

    Bill


    While you may not be able to "reason your way to an optimum solution", you can avoid some things based on our understanding of how the computers work.

    Assuming you are running on a modern box, assuming the disk you are talking about is one of these standard spinning-platter affairs, and assuming you don't have to use remote libraries on another machine somehow or other such unexpected stuff, there are some things we can say about your problem.

    1. disk activity (open, read, seek, etc.) is going to be slow compared to cpu activity. Your original problem statement mentioned "thousands of files", so I think we're safe in saying that optimizing for many disk operations is called for. Multi-threading is good for speeding up programs that can be put to work computing one thing while the I/O gets other things.

    2. other activities that have been mentioned: unicode conversion (though I don't know where that came from), regular expression handling, etc., are all compute-intensive, and could be memory-intensive depending on how they're done. Multi-threading is no good for improving performance by separating out different compute-intensive operations, other things being equal. If your CPU is soaked, it's soaked, and switching to another thread is not going to make things any faster. It may slow things down.

    We don't have to measure to know these things. We can make a good guess as to where the problems will be, and a way or two in which multi-threading will NOT help, and I think those should be used in the original design. Even if you're going to write some code to benchmark, it helps to have a good starting place, and avoiding the pathological cases is usually not that hard with some design aforethought.

    rc
    William Brogden
    Author and all-around good cowpoke
    Rancher

    Joined: Mar 22, 2000
    Posts: 12769
        
        5
    Constructing Java Strings while reading text files involves character by character conversion of bytes to UNICODE.

    My old copy of Bulka's Performance book mentions this as surprisingly time consuming - I don't know if modern Java is any faster.

    I have optimized ASCII text manipulation by working directly with bytes instead of Strings.

    Bill
    prem pillai
    Ranch Hand

    Joined: Nov 02, 2007
    Posts: 87

    Ralph Cook wrote:
    We don't have to measure to know these things. We can make a good guess as to where the problems will be, and a way or two in which multi-threading will NOT help, and I think those should be used in the original design. Even if you're going to write some code to benchmark, it helps to have a good starting place, and avoiding the pathological cases is usually not that hard with some design aforethought.




    Ralph , what you said is 100% logical ... yes such things should be evaluated and used in the original design itself. Measurement is an important step in performance tuning but that doesn’t necessarily mean that we should write-off application of logical reasoning. After all, this thread is talking about performance considerations to be made while opting a solution ; Tuning comes at a later stage…
    steve souza
    Ranch Hand

    Joined: Jun 26, 2002
    Posts: 860
    I recently created a program that parsed web logs. With a little tuning I parsed about 10 MB a second. For my needs that was plenty fast enough. Can you come up with a number that would be acceptable? As I recall the reading of the file was the most expensive part of the operations as compared to Regular Expressions or any other cpu operations, so tuning IO had the biggest impact on performance. If this is true for you multi-threads may not help much. Of course as always you should time things to know where to tune.

    I never had to make my program multi-threaded nor would I without knowing I needed to. It shouldn't take much effort to start with a small program that simply reads the log and time it. Then add simple versions of your other algorithms and time them. You should quickly see if you have bottle necks in your design or not.

    I usually build monitoring into my application (using JAMon) not just for performance reasons, but I also use it as a form of logging. Certainly I would do that if I had an application that has to be fast.


    http://www.jamonapi.com/ - a fast, free open source performance tuning api.
    JavaRanch Performance FAQ
    steve souza
    Ranch Hand

    Joined: Jun 26, 2002
    Posts: 860
    I would be interested if anyone has any numbers for performance benefits of NIO. My program was fast enough, but I always wondered if NIO would have been a good way to go had I needed better performance. NIO seems to be lower level and more complex to code in.
    steve souza
    Ranch Hand

    Joined: Jun 26, 2002
    Posts: 860
    Something else I had to consider was how not to reprocess files that had already been processed (for example you have to restart your program). I did this by writing all processed file names to a file, and I would check this file before processing any files to ensure they hadn't been processed already.
    Ralph Cook
    Ranch Hand

    Joined: May 29, 2005
    Posts: 479
    steve souza wrote:I recently created a program that parsed web logs. With a little tuning I parsed about 10 MB a second. For my needs that was plenty fast enough. Can you come up with a number that would be acceptable? As I recall the reading of the file was the most expensive part of the operations as compared to Regular Expressions or any other cpu operations, so tuning IO had the biggest impact on performance. If this is true for you multi-threads may not help much. Of course as always you should time things to know where to tune.


    I am in favor of considering whether something is fast enough before spending time trying to make it faster, whether it be by threading or optimization of some other sort (pun intended).

    It is not necessary to "come up with a number that would be acceptable", certainly not in advance of improving things. It can be enough to know that it isn't as fast as you would like it, and you think you can make it fast enough with a reasonable amount of programming. You need numbers for some things -- like knowing whether you've succeeded! -- but I don't think it rates an "of course".

    It is an interesting point about "tuning I/O" versus using multiple threads, and I think it is (yet) another area where thinking about things a bit ahead of time (it used to be called software design) is very useful, along with knowing what kinds of improvements to expect from what kinds of changes.

    If you can do reads to large buffers instead of small ones, that could make a huge difference. No multi-threading, or in fact much programming, required here. Easy to try, much easier than messing with multiple threads.

    Now, if there is enough I/O that you want to compute while waiting for your hardware to do things, THAT's where multiple threads can come into its own, as I mentioned in an earlier response.

    rc
    Himanshu Gupta
    Ranch Hand

    Joined: Aug 18, 2008
    Posts: 598

    I tried my earlier suggested approach using 3 threads on a I3 Processor machine and it processed 16.9 MB of unzipped data files in 13663 milliseconds for the first run and 1097 milliseconds for the second subsequent run.
    I restarted my system to confirm and after restarting the process was completed in 4931 milliseconds. I know that working with threads will never give same time in all of the runs but I was aiming to get some average value.

    Is this major difference of time is due to the loading of the files in primary memory?
    If the code is needed to analyse this or any other tool can help then please let me know.

    I am moving towards creating a single thread for reading the files into some large buffer and having multiple threads for the processing. I request all if I can get some link or help to design such implementation then it will save my time. Also which Java Class can be used as a buffer?

    My current Implementation reads file in the following way.

    Ralph Cook
    Ranch Hand

    Joined: May 29, 2005
    Posts: 479
    If you're running on Windows, that fits with other time-testing I've done that involved file reading. There's some major caching of file contents going on somewhere, is what I concluded, though I would have expected "restarting the system" to wipe the caches, perhaps it doesn't. I think your successive times after the first might converge to an average.

    You might look at BufferedReader; it has a readLine method, and makes no attempt at the mark and reset functionality that BufferedInputStream has. Both of them take an int value for the size of their buffer in their respective constructors, so you can try it with different values.
    steve souza
    Ranch Hand

    Joined: Jun 26, 2002
    Posts: 860
    It is not necessary to "come up with a number that would be acceptable", certainly not in advance of improving things.

    Maybe a more accurate statement would have been 'it is hard to help you unless we know more about the performance you are trying to attain'. Certainly, being as you can make a pretty quick assessment of what the performance needs of his program are in this case it would help us if he did it. For example: 1000 files*once a day*5mb avg file size=5000mb a day. Given that performance target I don't think the extra complexity warrants threads (I'm not even convinced threads will help him). I was able to get 10mb a second processing on a similar program so a little calculation would show that 5000mb a day/10mb a second=50 seconds. I would suggest writing a short prototype program what his performance would be and see if that it is acceptable.

    I agree that it is best to make good design decisions from the start, but often it is a fine line between good design and premature optimization. Because threads often don't have the expected performance improvements on IO bound processes (or due to synchronization issues due to the use of shared data structures) I would create performance tests first to see if threads help. I would build performance monitoring from the start in this case. Anytime you introduce threads into your design for performance reasons you have elevated performance to be a key attribute to your program and should track it. And I wouldn't introduce threads until I proved they helped. Often there are simpler ways to improve performance. Let me give an example.

    My log parsing program was processing 7 MB second and by passing the "-server" argument to the command line I was able to increase performance to 9.3 MB per second with no code changes. I then upgraded from jdk 1.5 to jdk 1.6 and was able to improve performance to 10.3 MB per second. Both changes were easy to make and had a bigger impact than threads would make in my case as the big bottleneck was IO and not processing.

    Tim Holloway
    Saloon Keeper

    Joined: Jun 25, 2001
    Posts: 16014
        
      20

    I haven't heard any actual numbers in years, but in simpler times, the difference between disk I/O and CPU activity was over 1000-to-1, and while both disks AND CPU have gotten faster, I doubt it's any better now, and is quite possibly a lot worse.

    Modern-day disk drives have a fair amount of on-board cache. Although cache primarily benefits writing, there's also read-ahead cache, which is generally tunable. Read-ahead cache is especially valuable in cases where large numbers of sectors are being read in sequentially. In many filesystems, one logical block can be multiple physical sectors, so at a minimum, read-ahead should be set to be at least one logical block size. Bump it up further if you have large files and/or a filesystem layout where data is mostly sequential.

    Both Windows and Linux will take any idle RAM that's laying around and attempt to employ it as filesystem cache. Mostly that won't help on read activities, but a little tuning may gain something.

    Multi-threading, as mentioned, can have a negative performance input because it can result in a lot of seek and spin latencies. However, use of multiple disks on multiple controllers can help here, as can employment of suitable scheduling algorithms (Linux allows you to pick an I/O scheduler). A good scheduler can arrange incoming requests so as to minimize the amount of seeks required.

    There are some disk/filesystem benchmark suites available. It's not a bad idea to find one that can be set up to approximate as closely as possible your production enviroment and set it off doing measurements and test runs. You may get some useful stats that way.

    And, of course, the ultimate test is to measure the live data processes.


    Customer surveys are for companies who didn't pay proper attention to begin with.
     
    Don't get me started about those stupid light bulbs.
     
    subject: Parsing thousands of files.