*
The moose likes Programming Diversions and the fly likes What will be the most efficient method to solve this? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Other » Programming Diversions
Bookmark "What will be the most efficient method to solve this? " Watch "What will be the most efficient method to solve this? " New topic
Author

What will be the most efficient method to solve this?

Himanshu Gupta
Ranch Hand

Joined: Aug 18, 2008
Posts: 598

Given two log files, each with a billion usernames (each username appended to the log file), find the usernames existing in both documents in the most efficient manner?


My Blog SCJP 5 SCWCD 5
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11444
    
  16

My first question would be 'how do you define efficient? lines of code? time from reading the problem to providing the answer? time it takes the computer to chug out the answer, once you start your program?

next, you'd need to know the format of the names. Are they sorted? Are there duplicates in a single file (i.e. does the name 'fred' appear twice in file 'a')? do both files have the exact same number of names? How long are the names?

my first thoughts would be to build some kind of tree out of the names in one file, and then run the other names against it.

Of course, if the provided names are both already sorted, it's easier:


loop until the next of either file returns EOF.


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Himanshu Gupta
Ranch Hand

Joined: Aug 18, 2008
Posts: 598

Thanks Fred for your reply. What I was thinking is what if the file has duplicates... will BST work??.. I think it will not .

In case it does not contain any duplicates then it will take a long time to make BST representing each file.


BST: Binary Search Tree
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
Himanshu Gupta wrote:Thanks Fred for your reply. What I was thinking is what if the file has duplicates... will BST work??.. I think it will not .

Why not? If a given username is already in the tree, ignore it. The instructions don't seem to ask us to count how many times a name occurs, after all. Or if you do need to count, then modify the BST to include a count value. Thus is will function much like a TreeMap, where the key is a username, and the value is the number of times the username was already counted.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
Himanshu Gupta wrote:In case it does not contain any duplicates then it will take a long time to make BST representing each file.

Well, "long time" is all relative. If we're counting developer time, I bet we'd get decent results just using whatever sorting tools a given developer is already familiar with (like maybe "cat usernamefile | sort", for the Unix-inclined), sorting both lists of names, waiting for the result, and then using the algorithm Fred outlined to find out what's common to both lists.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
On the other hand, if the objective is to create a batch job that will run every night on a highly-stressed machine, then developer time probably drops out of the picture, and we may need to do more work. Fred's algorithm is quite good if the contents are already sorted. But I doubt they are - the problem statement said we had a log file, and these are user names. I've never seen a log file that wasn't in chronological order, at least approximately. (In multithreaded environments, the logged order may be a little off.) So I think we should assume that sorting the usernames has not been done for us already. Generally, the best sorts you'll get are O(N*log(N)). Fred's algorithm is O(N), quicker than the sort itself (for sufficiently large values of N). So overall, any sort-dependent algorithm is O(N*log(N)).
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
Can we achieve O(N)? I'm pretty sure we can. Using HashSets, it would be pretty easy to add all of list 1 into a HashSet (O(1) * N = O(N)), then check each element of list 2 to see if it's in the set (again, O(1) * N = O(N)). Overall, it's O(N). The only problem is, if there are a billion names in each list, it may not be possible to store all the names from one list in memory. (Unless, of course, there are many duplicates. Duplicates actually make this problem easier.) So I guess you'd need to find (or make) a file-backed hashset/hashmap equivalent. Doesn't seem too hard - is there a term for this sort of thing? It's probably a part of most decent database systems already; I just don't know what it's called there. Internally, you could use filenames based on hash codes. Maybe part of the hashcode goes into the file name, and part of the hashcode goes into an offset within a single file. That way you don't have to have too many different tiny files. It all depends on how much time you want to spend on optimization, to find the best path.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
Also, it may be worthwhile to check how quickly your favorite database can find the answer for you. It's possible you'll get the benefit of both reduced development time, and good performance optimization.
Gabriel Claramunt
Ranch Hand

Joined: May 26, 2007
Posts: 375

If you have access to unix:


Pretty effective in terms of development time


Gabriel
Software Surgeon
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: What will be the most efficient method to solve this?