GeeCON Prague 2014*
The moose likes Performance and the fly likes Find shortest string with all the words in any order Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Performance
Bookmark "Find shortest string with all the words in any order" Watch "Find shortest string with all the words in any order" New topic
Author

Find shortest string with all the words in any order

Red Ronin
Greenhorn

Joined: Feb 18, 2012
Posts: 1
Hi all,

this is my 1st post here, so sorry if i missed forum to post this question.

I was solving this code question (like for interview):
"Given a large document and a short pattern consisting of a few words (eg. W1 W2 W3), find the shortest string that has all the words in any order (for eg. W2 foo bar dog W1 cat W3 -- is a valid pattern)"
this is from here: http://www.glassdoor.com/Interview/Given-a-large-document-and-a-short-pattern-consisting-of-a-few-words-eg-W1-W2-W3-find-the-shortest-string-that-has-all-QTN_216408.htm

Here is my code. I tested this using Peter Norvig's big.txt (~1mil words) and it runs just under 3sec for my test case.

Is this fast enough? Other comments? Thankx in advance!


Ireneusz Kordal
Ranch Hand

Joined: Jun 21, 2008
Posts: 423
Red Ronin wrote:
Is this fast enough? Other comments? Thankx in advance!

If performance satisfies you (or your customer) ..... then yes, it is fast enough.
In other case there are some improvements possible.

First, you don't need 6 different regex patterns to check this condition,
only one is sufficient:


Follow this link to study regex patterns, it's really cool: http://www.regular-expressions.info/lookaround.html

Second, don't use regex at all,
simple 3 calls to a function String.contains() would be much more faster:
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#contains%28java.lang.CharSequence%29


Third, the given problem is: "find the shortest string".
In your case, a "string" is equivalent to one line from the file.
So, if you have found a line that matches the condition, and the length of this line is X,
then you need to check only lines that have length < X.
For each new line check if it size is < X, if no - then just skip this line and take the next one.
If you find a new line shorter than X, that meet the condition too, then set X = length on this new line.
In this way you avoid searching substrings within the string probably for 50% or even more lines.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7892
    
  21

Ireneusz Kordal wrote:Third, the given problem is: "find the shortest string".
In your case, a "string" is equivalent to one line from the file.
So, if you have found a line that matches the condition, and the length of this line is X,
then you need to check only lines that have length < X.

Hmmm. Not so sure about that (although I totally agree that contains() is better than a regex).

If you used indexOf() and stored the results, you could then check the span between the first and last occurrence to work out your "shortest string", maybe something like:but the wording of the problem is very open to interpretation.

Winston

BTW: The above could be optimized along the lines that Ireneusz suggested to eliminate the third search if the first two produce a span that is greater than the one you already have; but personally I suspect it'll make the code a lot messier for not a lot of gain.


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Victor M. Pereira
Ranch Hand

Joined: Mar 02, 2012
Posts: 50
What if the shortest string is between lines? What if the shortest string is a sub string in a line? ex: a w1 w2 w1 w3 x

I believe the shortest Time possible is O(n) for this problem, since divide and conquer would give O(n log n).

Your algorithm would be the almost correct one since checking for 6 patterns is still consider constant order, you would only have to change your line reader and evaluate more like the pseudo code from Ireneusz Kordal.

Performance means the fastest and right one.


To be completely honest this is a limited example of maximum or minimum sub sequence where S is only of w1, w2, w3 and a that represents any other word.


regards,
Victor M. Pereira
 
wood burning stoves
 
subject: Find shortest string with all the words in any order