• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

What will be the most efficient method to solve this?

 
Ranch Hand
Posts: 598
Android Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Given two log files, each with a billion usernames (each username appended to the log file), find the usernames existing in both documents in the most efficient manner?
 
lowercase baba
Posts: 13089
67
Chrome Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
My first question would be 'how do you define efficient? lines of code? time from reading the problem to providing the answer? time it takes the computer to chug out the answer, once you start your program?

next, you'd need to know the format of the names. Are they sorted? Are there duplicates in a single file (i.e. does the name 'fred' appear twice in file 'a')? do both files have the exact same number of names? How long are the names?

my first thoughts would be to build some kind of tree out of the names in one file, and then run the other names against it.

Of course, if the provided names are both already sorted, it's easier:


loop until the next of either file returns EOF.
 
Himanshu Gupta
Ranch Hand
Posts: 598
Android Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Fred for your reply. What I was thinking is what if the file has duplicates... will BST work??.. I think it will not .

In case it does not contain any duplicates then it will take a long time to make BST representing each file.


BST: Binary Search Tree
 
Master Rancher
Posts: 4806
72
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Himanshu Gupta wrote:Thanks Fred for your reply. What I was thinking is what if the file has duplicates... will BST work??.. I think it will not .


Why not? If a given username is already in the tree, ignore it. The instructions don't seem to ask us to count how many times a name occurs, after all. Or if you do need to count, then modify the BST to include a count value. Thus is will function much like a TreeMap, where the key is a username, and the value is the number of times the username was already counted.
 
Mike Simmons
Master Rancher
Posts: 4806
72
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Himanshu Gupta wrote:In case it does not contain any duplicates then it will take a long time to make BST representing each file.


Well, "long time" is all relative. If we're counting developer time, I bet we'd get decent results just using whatever sorting tools a given developer is already familiar with (like maybe "cat usernamefile | sort", for the Unix-inclined), sorting both lists of names, waiting for the result, and then using the algorithm Fred outlined to find out what's common to both lists.
 
Mike Simmons
Master Rancher
Posts: 4806
72
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
On the other hand, if the objective is to create a batch job that will run every night on a highly-stressed machine, then developer time probably drops out of the picture, and we may need to do more work. Fred's algorithm is quite good if the contents are already sorted. But I doubt they are - the problem statement said we had a log file, and these are user names. I've never seen a log file that wasn't in chronological order, at least approximately. (In multithreaded environments, the logged order may be a little off.) So I think we should assume that sorting the usernames has not been done for us already. Generally, the best sorts you'll get are O(N*log(N)). Fred's algorithm is O(N), quicker than the sort itself (for sufficiently large values of N). So overall, any sort-dependent algorithm is O(N*log(N)).
 
Mike Simmons
Master Rancher
Posts: 4806
72
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Can we achieve O(N)? I'm pretty sure we can. Using HashSets, it would be pretty easy to add all of list 1 into a HashSet (O(1) * N = O(N)), then check each element of list 2 to see if it's in the set (again, O(1) * N = O(N)). Overall, it's O(N). The only problem is, if there are a billion names in each list, it may not be possible to store all the names from one list in memory. (Unless, of course, there are many duplicates. Duplicates actually make this problem easier.) So I guess you'd need to find (or make) a file-backed hashset/hashmap equivalent. Doesn't seem too hard - is there a term for this sort of thing? It's probably a part of most decent database systems already; I just don't know what it's called there. Internally, you could use filenames based on hash codes. Maybe part of the hashcode goes into the file name, and part of the hashcode goes into an offset within a single file. That way you don't have to have too many different tiny files. It all depends on how much time you want to spend on optimization, to find the best path.
 
Mike Simmons
Master Rancher
Posts: 4806
72
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Also, it may be worthwhile to check how quickly your favorite database can find the answer for you. It's possible you'll get the benefit of both reduced development time, and good performance optimization.
 
Ranch Hand
Posts: 376
Scala Monad
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you have access to unix:


Pretty effective in terms of development time
 
If you are using a rototiller, you are doing it wrong. Even on this tiny ad:
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic