aspose file tools*
The moose likes Performance and the fly likes Efficient Method for comparison Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Efficient Method for comparison" Watch "Efficient Method for comparison" New topic
Author

Efficient Method for comparison

Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
What is the efficient method for determining new files from a list of files with a large database keeping entries of already processed file paths. How the speed of this operation can be increased.

Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 30753
    
156

Pradeep,
First thing is to make sure you have an index on the file path column. Then do a query like:
"select fileId from files where filePath = ?"

Using the prepared statement allows the execution plan to be cached. If you have a lot of files, you can use a query like:
"select fileId from files where filePath in (?,?,?,?,...?)"

Doing this batching with a set number (like 100) is faster than individual queries and retains the benefits of a prepared statement.


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12803
    
    5
Seems to me that if new files appear in the system with date last modified set correctly, your first pass would be to look at dates.
Bill
Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
Yup I do the checking for the directory modified date. If it is greater than the previous check then I consider the files under it. You know that just by copying or moving a file into a directory under Windows won't change its modification date. So i store the file path, size and modification date in the database.

This information is compared against the list to determine new files as well as modified files.
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
What do you mean by "large"? Thousands of entries? Millions?

What do you already have, and how fast is it? How fast do you need it to be?


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
The large table contains about 100,000 records. Now what i do is create a temp table from the list of files. Then to determine the one's that are new in the list; do a inner join on the filepath and then do a outer join on the result.

This will return a list of entries that are only in temp table and not in the main table. Then i insert the new docs and process them.


This process takes about 7 minutes with a list of 3000 documents. Can you suggest a method to reduce this to a few seconds.

:roll:
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
How much RAM are you allowed to use? One fast approach possibly would be to build a HashSet of all the records in RAM, but that could easily consume hundreds of MB, I'd guess.

Do you have any indices on the tables?

Do you know which of the sql statements are consuming how much time? (Google for p6spy for a nice free tool to find out.)
David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
If you use a profiler to first determine which parts are taking the most time, then you'll be in a better position to optimize.

For example, assume that you are checking a lot of directories in a tree (rather than one directory holding all of the documents to be stored). For each directory, inserting the names of all files and performing a query is a very expensive operation. If any single directory has a high chance of not having any new files, then finding a quick way to remove it as a candidate for having new files would pay off.

One way to do this would be to create a hash code for the directory (combine the hash codes of all file names or hash all the file modification times or find some other quick hash calculation). To check if a directory might have new documents, calculate its hash and compare it to the one in the database. If it's the same, it's very unlikely that the directory has any new documents and can be skipped.

But without profiling, you're grasping at straws in the dark.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

I didn't see someone mentioning creating indicees for the database.
That's often a plus for speed (and a minus for size).


http://home.arcor.de/hirnstrom/bewerbung
Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 30753
    
156

Originally posted by Jeanne Boyarsky:
First thing is to make sure you have an index on the file path column.

Stefan,
I did mention it, but something this important can't be mentioned often enough

Pradeep,
100,000 rows isn't that many. After some tuning, you will definitely get it down to the order of seconds. Note many database come with their own profilers/query explain tools.
Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
Guys the SQL query will be executed within a second. But the delay comes from iteration through the result set.

What I noticed is that if I use a simple natural join on the table the query execution is slower, but iteration through the result set is faster; about 1/10th faster.

Does the resultset has some thing to do with the nature of the query.

Jeanne Boyarsky
author & internet detective
Marshal

Joined: May 26, 2003
Posts: 30753
    
156

Originally posted by Pradeep KG:
Does the resultset has some thing to do with the nature of the query

It depends on the database. Some can return the first few rows right away and get the rest as you iterate through. This only works for some queries though, so I suspect that is what is happening here.
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12803
    
    5
How do new files get into these directories anyway. Do users just copy them in? Can you create a formal way to add files and use that to maintain your work list?
Bill
Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
The directory is a repository from where the application picks up the files. The user just dumps the files or directories in them.

The task is to find new or modified files; which is tracked in the database.
David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
Which suggestions have you tried? What effect did they have on the timings? Do you know which parts take the most time? What percentage of time is spent inserting the list of files for each directory compared to executing the outer join query to find the new files and iterating the results?

What do you think about the idea I proposed above about creating a hash of each directory to avoid performing all the database work for cases where a directory hasn't changed?
Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
Hashing won't help because I have already taken care of this by storing the previously scanned time.

Next time I am comparing this with the modified time of directories and will take only the ones that have a modification time greater than the entry for that folder.

This will reduce the candidates in the comparison.

Any optimizations in the database(SQL Server) side or Java side may be useful.

Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Why are you storing the filenames in a database?

I suspect that having a plain text file in each directory and processing that fully in memory might actually be more efficient in this case...
Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
I need to store the details in the database because I want to track these files through their Unique Id. This will be required in further processes.

I have to ensure that the files in the folder and my table are synchronized. i.e. upto date.

So I had to use this idea. If i use text file local to the directory there cannot be a centralized tracking of files.

David Harkness
Ranch Hand

Joined: Aug 07, 2003
Posts: 1646
You could combine the approaches. Store the file names you've processed in the database for tracking within the application. Also store a file in each directory with the names of files you've already processed (same set as in DB) and use that to determine which files to process.

Heck, you could even put the processed files in a Set (one per directory, same as above) and serialize it into a BLOB in the database. This keeps all tracking information in the database.

And again I'll recommend profiling so you know which parts take the most time. Is it inserting the file names for a directory or doing the outer join and iterating the results that takes longer?
Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
Nice approach this reduces my suspect documents list and this is an optimization.

But the bottle neck in the program has been querying the database.

What would you suggest to get list of unique files in this scenario. What would be the best approach to get the unique entries in the temporary table?

Well David thanks for that approach. That definetly reduces computation.

Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Can you show us the SQL queries you use?
Pradeep Kadambar
Ranch Hand

Joined: Oct 18, 2004
Posts: 148
I am using the following query to find the entries that are in Temp table and not in Documents table.



Hope this helps.

Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Isn't that equivalent to:
 
wood burning stoves
 
subject: Efficient Method for comparison