This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.

I want to generate random strings, of random length (up to some arbitrary maximum), where the probability of any possible string is the same.

Sounds simple, huh? Well, maybe it is, but I can't think of an easy way.

My current algorithm is: -

1. Randomly choose the length, N, between zero and the maximum, M.

2. Initialise storage for N characters.

3. Randomly choose each of the N characters, from the C possible characters.

This does *not* meet the requirements, because a particular short string is much more likely to come up than a long string.

I can think of a most-bogus solution, along the lines of Bogosort. That is: compute all the possible strings, then choose one at random. However, the memory and computational costs of this are ridiculous.

Any ideas?

Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma.

Sounds like you can't exceed the number of possible strings of the minimum length. So if we only use lower case alpha characters and the minimum length is 1 we have to choose from a set of 26 strings per possible length. If your upper length is 2 we have 42 strings to choose from.

get the number n of possible strings at minimum length seed the RNG using n - must be repeatable to maintain the odds, no? pick a random string index in range of 1, n pick a random length generate n strings return the nth string

Does that makes sense?

A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi

Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970

posted

0

Thanks for your interest.

Originally posted by Stan James: Does that makes sense?

Well, not if I understand correctly what you're saying.

The number of possible strings of length L, where each character is one of C characters is C^L. That's C to the power of L, not C times L as you seem to be saying. So, even with only 26 characters, the number of possible strings rapidly becomes comically huge, even with relatively modest values of C and L. For instance, 26^6 is 308915776; that's a lot of strings.

In my actual application (this is not just a puzzle for fun), C is 94 and Lmax is 300!

If I have misunderstood what you're saying, then perhaps you could post some more-detailed pseudo-code (or real code) to illustrate your solution. [ January 23, 2006: Message edited by: Peter Chase ]

You must choose length N with probability C^N * (C-1) / (C^(M+1) - 1).

Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791

posted

0

My C*(LMax-LMin+1) (26*2) was to allow 26 1-character strings, and 26 2-character strings. Seeding the RNG for repeatability and starting from the first random every time ensures that we will create only 26 unique 2-character strings, not C^L random possibilities. That keeps the probability of getting a particular string a consistent 1/42 whether length is 1 or 2. Did I correctly read that as the goal?

LMax goes out of the equation because LMin is the constraint, but for your numbers I'd have to be prepared to get 94^LMin random numbers which would probably run too long at 100% CPU.

Keep in mind: Music major here. This was fun but probably just passed my math skills.

If you would like a truely random string... I would suggest that you ask someone like to to spell a number of words and what is produced can guaranteed to be nothing but random letters of an arbituary length.

I can think of a most-bogus solution, along the lines of Bogosort. That is: compute all the possible strings, then choose one at random. However, the memory and computational costs of this are ridiculous.

Any ideas?

Actually, you don't have to compute all the possible strings. You can choose a random number and compute the string based on that number.

For example, if you had to generate a random lower case string of max length 2, then there are 702 possible strings. Strings 0-25 would be single character string a,b,c,d,....z. String 26 - 701 would be 2 character strings aa,ab,ac,ad....zz. So, you select a random number from 0-701, and use number to figure out the string.

For example, if your random number generator gave you 523, you follow these steps a) 523 mod 26 = 3 b) character 3 is d, so, next character is d, random string = d c) 523/26 = 20 d) 20 mod 26 = 20 e) 20 character is t, so next character is t, random string = td f) 20/26 = 0, so stop processing

If you want random strings of max length 3, you use similar logic except that you generate a random number between 0-18271, because the number of combinations are 18272. You don't have to limit yourself to 26 characters. You can use any number of characters. You will have to compute the number of combinations based on the number of characters

you're gonna have a problem here... a 20 character string is going to require a random number between 0 and 2,432,902,008,176,640,000. since the OP said "arbitrary length", i would assume 20, 50 or even 100 it not out of the question.

or, wait... i think i did 20!, i should have done 26^20, which gives us

19,928,148,895,209,409,152,340,197,376 [ January 25, 2006: Message edited by: fred rosenberger ]

There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors

Jayesh Lalwani
Ranch Hand

Joined: Nov 05, 2004
Posts: 502

posted

0

Originally posted by fred rosenberger: you're gonna have a problem here... a 20 character string is going to require a random number between 0 and 2,432,902,008,176,640,000. since the OP said "arbitrary length", i would assume 20, 50 or even 100 it not out of the question.

or, wait... i think i did 20!, i should have done 26^20, which gives us

19,928,148,895,209,409,152,340,197,376

[ January 25, 2006: Message edited by: fred rosenberger ]

Use multiple random numbers then. Alternatively, generate two-10 character segments then. Or four-5 character segments. Introduce the possibility of segments of zero length.

Fred, I agree with you at a theoritical level. At some point, a randomly generated dictionary would be more cost-effective than a true random string generator.

Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970

posted

0

Originally posted by Paul Clapham: You must choose length N with probability C^N * (C-1) / (C^(M+1) - 1).

For the record, this one sounds the most feasible so far. I'm not sure that I quite see what N, M and C are, but guess I could work it out.

George Harris
Ranch Hand

Joined: May 05, 2003
Posts: 84

posted

0

How about if you were to loop though max size C. You then get a random number (0 to number of possible characters) where 0 would mean you skip that position, and then give the result. For example in for a max string length of 4...

2 . 16 . 1 . 0 would give the string 'bpa' 5 . 12 . 8 . 24 would give 'elhx'

It would also require that if the string had an embedded 0,

2 . 0 . 16 . 1 the string would be rejected and a restart was required.

2 . 0 . 0 . 0 would be valid however. [ January 25, 2006: Message edited by: George Harris ]

Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970

posted

0

George H: I think you've got a reasonable solution there. Takes a bit of thinking through, but I think it works. Not sure whether I prefer this solution, or the probability distribution one given earlier. Yours is more ingenious, certainly.

Originally posted by Peter Chase: For the record, this one sounds the most feasible so far. I'm not sure that I quite see what N, M and C are, but guess I could work it out.

They are what you said they were in your original question.

However if C is 94, then the probability of choosing a string of anything less than the maximum length M (or Lmax as you called it later) is almost exactly 1/94 (regardless of what M is) and the probability of choosing a string of the maximum length is 93/94. So you might simplify the problem to only choose strings of length M -- a major simplification.

I have not done any math, but you mayhave a problem with George's solution...

if i undertand it correctly, he's saying whatever your max length is, generate a random string the full length. then, scan it to see if there is an imbedded 0.

for large max lengths, there are going to be a LOT of strings you'll have to reject, i think. it may take a while to find a valid one.

i'm not SURE of this, it's just my intuition (whatever that's worth...).

Jayesh Lalwani
Ranch Hand

Joined: Nov 05, 2004
Posts: 502

posted

0

Originally posted by fred rosenberger: I have not done any math, but you mayhave a problem with George's solution...

if i undertand it correctly, he's saying whatever your max length is, generate a random string the full length. then, scan it to see if there is an imbedded 0.

for large max lengths, there are going to be a LOT of strings you'll have to reject, i think. it may take a while to find a valid one.

i'm not SURE of this, it's just my intuition (whatever that's worth...).

I beleive you are right

For a string of length 3 Total number of combinations = 27^3 = 19683 Total number of valid combintaions = 18278 Total number of invalid combos = 1405 = 7%

For string of length 5 Total = 14348907 Valid = 12356630 Invalid = 1992277 = 13%

It keeps on growing. For string of length 20, 51.11% of the combos will be invalid. For string of length 63, 90.35% of the combos will be invalid.

The time required for George's solution does indeed increase dramatically with string length. However even at 100 chars, I can generate 10000 of them in about 5 seconds on my ancient 500 MHz Pentium III machine. A length of 150 takes 30 sec, and length 200 takes 3 minutes. Obviously it's going to get worse and worse, but remember, those numbers are all for 10000 of the things. And how long do the strings really need to be, anyway? This is using uppercase chars only. Time goes down considerably if we include uppercase as well, since that decreases the chance of an invalid string at each character position. 10000 200-char strings take under 10 seconds. The number of strings shorter than max length stays around 1/27 for uppercase-only, and 1/53 for mixed case. As expected.

"I'm not back." - Bill Harding, Twister

Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970

posted

0

In the real application, there are 94 characters to choose from and a maximum length of 300. Having 94, rather than 26, characters, makes things a lot better.

I reckon that George's method could be used for the application, but would eat up a lot more CPU than the probability distribution method, which just adds a little bit of maths to my original algorithm.

Originally posted by Paul Clapham: They are what you said they were in your original question.

However if C is 94, then the probability of choosing a string of anything less than the maximum length M (or Lmax as you called it later) is almost exactly 1/94 (regardless of what M is) and the probability of choosing a string of the maximum length is 93/94. So you might simplify the problem to only choose strings of length M -- a major simplification.

I reckon I can do better, with the same basic algorithm: -

I'm finessing away the test for len becoming less than one, because in practice it will never happen. The probability (1 in 94^300) is so insanely minute that one can say that with total confidence.

If there were fewer characters to choose from or a smaller maximum length, then such a test would be necessary. [ January 28, 2006: Message edited by: Peter Chase ]

... at least until random.nextInt (94) isn't broken

[ January 28, 2006: Message edited by: Stefan Wagner ] [ January 28, 2006: Message edited by: Stefan Wagner ]

Peter Chase
Ranch Hand

Joined: Oct 30, 2001
Posts: 1970

posted

0

Originally posted by Stefan Wagner: ... at least until random.nextInt (94) isn't broken

Did I make a mistake? I admit I didn't compile or run my code. I am assuming that "random" is an instance of java.util.Random.

George Harris
Ranch Hand

Joined: May 05, 2003
Posts: 84

posted

0

Ok, a slight modification to my previous algorithm. Instead of rejecting embedded zeros, I would suggest shifting the new non-zero to the left and continuing from there. For example for a string length 10 we would start and get the following

23 . 16 . 0 . 0 . 0 . 19 ...

Instead of rejecting this string, we would move the 19 to the left and continue.

23 . 16 . 19 . 7 . etc.

I believe this will still result in the correct distribution of string lengths, with a significant improvement in processing.

Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671

posted

0

George, I don't think that works. Consider an extreme case: we want to generate strings of max length 2, using an alphabet with only two characters, A, and B. The possible strings (excluding zero-length) are:

Length 1: A, B Length 2: AA, AB, BA, BB

So the chance of generating a 1-char string should be 2/6 = 1/3.

Now try your algorithm (assuming I've understood it correctly):

0.0 : [reject] 0.1 : A 0.2 : B 1.0 : A 1.1 : AA 1.2 : AB 2.0 : B 2.1 : BA 2.2 : BB

Looks to me like there's a 4/8 = 1/2 chance of getting length 1, which is not correct. On the other hand if we reject 1.0 and 2.0, the remaining options given the required 2/6 = 1/3 chance for length 1. [ February 01, 2006: Message edited by: Jim Yingst ]

George Harris
Ranch Hand

Joined: May 05, 2003
Posts: 84

posted

0

Jim, I think you missed a step. In your example

0.1 : A

this would not necessarly be A, the first char would be A, but the second character would still have to be processed by a random number generator so,

0.1 could end up 1.0 - A 1.1 - A.A 1.2 - A.B

All that I was suggesting is that we don't have to start from scratch, only that we keep the 'good' number and still get the rest.

Unless I'm missing something here, we should be able to generate each random string in time O(n) where n is the length of the string, and stil have a uniform probability distribution across all string lengths and values, yes? Don't see doing it in less that O(n), but no reason for it to be more than O(n). For multiple strings k that suggests O(kn), although you might be able to play mod games to keep it at O(n). [ February 01, 2006: Message edited by: Reid M. Pinchback ]

Reid - SCJP2 (April 2002)

Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671

posted

0

Reid - yes, it's definitely possible to do this in O(n). I'd probably prefer to choose the string length first (probability for each length given in Paul Clapham's formula) and then randomly generate a string of that spsecific length. There are various ways to implement this, depending on how precise we want to be with the probabilities - but O(n) is definitely possible.

George's original algorithm is not O(n) certainly - I'm not sure what it is offhand, but it's notably worse than O(n). However I was intrigued because it is pretty simple to understand, and I was wondering just how fast it actually was. So I tested it a bit to see, and found it wasn't nearly as slow as I might have guessed. Given that it was pretty easy to understand, I wouldn't completely rule it out of consideration. However for a high-performance solution, I'd definitely go with something else.

As for George's revised algorithm - um, maybe. I'm still not sure I understand just how it works. The "and continuing from there" when he first described it was a bit too vague for me. I made a guess what it might mean, which was apparently wrong. However I think the whole thing is still a bit fuzzy. Which of the following contain "embedded zeros" which justify shifting and rerolling, and which are non-embedded zeros that actually reduce the length of the string?

0.1.1 1.0.1 1.1.0 0.0.0

Or more to the point, can you show code which implements the algorithm you're thinking of? There are several more choices to make in various special cases, and so far I can't see a version which will really give equal probabilities of all strings, but I could well be missing something.

George Harris
Ranch Hand

Joined: May 05, 2003
Posts: 84

posted

0

A quick version of the code that I was discussing in the 'shift' approach:

maxLength is the max string size and numChars is the number of available characters.

[ February 02, 2006: Message edited by: George Harris ] [ February 03, 2006: Message edited by: George Harris ]

Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791

posted

0

Step way back and help me out. (Music major, remember?)

First, the original goal of making all strings equally likely. If I have a length range of 1 or 2 and 2 characters I can get

A B AA AB BA BB

If I choose length at random I have an even chance of length=1->A|B or length=2->AA|AB|BA|BB. When I make 1000 strings I'll get A 25% of the time and AA 12.5% of the time. Fails the requirement, right?

I tried to solve this by saying there are only 2 strings possible at length=1, so I made only 2 strings possible at length=2. Now I have an even chance of A|B or AA|AB. AB and BB are right out, but who cares.

Was that even close? How does the latest algorithm do the trick?

George Harris
Ranch Hand

Joined: May 05, 2003
Posts: 84

posted

0

If you run the code that I provided, 1,000,000 runs, with a two character string, you get the following results:

time = 2471 ms

a = 16% b = 16% aa = 16% ab = 16% ba = 16% bb = 16%

I also tried to get 1,000,000 strings of maxLength 200 from and using any character from " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ;:'[]{},<.>/?-_=+`~1234567890!@#$%^&*()\\\"|";

It took:

time = 29627 ms

using only lowercase characters it took:

time = 30198 ms [ February 02, 2006: Message edited by: George Harris ]

Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791

posted

0

Cool. I had to modify Jim Y's example from way up the thread to show the shifted possibilities:

Now there are two ways to get each possible value. I woonta believed it. [ February 02, 2006: Message edited by: Stan James ]

Reid M. Pinchback
Ranch Hand

Joined: Jan 25, 2002
Posts: 775

posted

0

Ok, I've got an O(n) solution. The basic idea is that you can set up a recurrence relationship to solve the problem. To set up a bit of notation:

R(i) is a random string of length at most 'i'. M(i) is a random string of exactly length 'i' Pm(i) is the size of the population of M Pr(i) is the size of the population of R

Then R(i+i) will be randomly chosen between R(i) and M(i+1), and the relative probabilities of the choice are determined by Pm(i+1) and Pr(i).

The code is below. I used longs for the population sizes; you could probably muck around with using doubles instead for the probabilities. In fact you'd have to do that if the max string length gets too long because the integer-valued population coefficients would get too large to represent as longs.

Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671

posted

0

Hmmm... Reid, I tried adjusting the inputs a bit to get a smaller range of strings to look at:

This gives me output:

That doesn't look like a very even distribution to me. Am I misunderstanding how you intend this to be used? [ February 02, 2006: Message edited by: Jim Yingst ]

Reid M. Pinchback
Ranch Hand

Joined: Jan 25, 2002
Posts: 775

posted

0

That's bizarre. I've used smaller and larger alphabets, shorter and longer strings, fewer or greater number of iterations, and except for the bit of random noise you'd expect I get a uniform distribution. Definitely for a million iterations it should be pretty flat. I'll sleep on it, see if any light bulbs turn on.

Oh, and I thought of a different problem - it isn't O(n) because of the cost of generating each new string in an iteration - those are proportional in cost to the length. Easy solution though; you just use the recurrence for the probabilities, generate one max-length string, and use one random number to decide on the length (weighting the lengths by the probabilities).

Reid M. Pinchback
Ranch Hand

Joined: Jan 25, 2002
Posts: 775

posted

0

Well, I found what was causing the wierdness in the distribution, but I can't see *why* it is causing the wierdness. The last change I made was to shift to using long instead of int for the coefficients to allow for bigger strings and alphabets, which meant I needed to generate a random long instead of a random int. Unfortunately there isn't a Random.nextLong(range) method, so I just modded by the range - and that seems to be torqueing the distribution. If I shove the int version back in:

Then it works perfectly, although now the maximum string length you can handle will be around 16 or 17 characters. Modding long version by the range is twisting the distribution. Odd; for small ranges I'd have expected the residues to still be pretty uniformly distributed, but clearly that isn't the case.

Originally posted by Reid M. Pinchback: Unfortunately there isn't a Random.nextLong(range) method, so I just modded by the range - and that seems to be torqueing the distribution.

I think the API documentation for Random's nextInt() method sheds some light on why that is, and possibly suggests how you could write your own nextLong() method.

choose a random number between 0 and N (Math.random()*N)

also lets assume that character set frm which characters will be chosen is frm ascii values 0 to 199

let k be the no choosen

then choose 3k random nos (between 0 and 9)(Math.random()*10) and then make pairs of 3 nos each and select character corresponding to that ascii character

if no chosen is more than 199 then use the character with ascii value of no%200