• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

random shuffle of weighted list

 
Saloon Keeper
Posts: 10705
86
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm having trouble coming up with a design for this and I've been pondering this off and on for several months.

Generically the problem is that I have a list of entries and each entry has a numeric weight assigned to it. I want to shuffle the list to a mostly random order but where entries of a greater weight have a higher probability of appearing earlier on in the shuffled list.

An example would be a playlist that is a list of songs where the weight is the popularity in the range of 1-10.

As  a possible mental picture of how this might work is to take the original input list and build a new list from it where each entry is repeated by the weight number, so songs with a weight of '3' would each appear in the new list '3' times, etc.. Then shuffle the new list. Then the tricky part, making a final list with distinct songs in the same order as they first appeared in the "new  list".

An approach like this might work but I was hoping for something more mathematically based so that the weights could have very large ranges or be floating point values.

I'm open to suggestions. (Should I have posted this in the Beginner forum?)
 
Master Rancher
Posts: 4806
72
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

I dislike the repeated retries inside the shuffle() method - as we get close to the end, set(add) value usually has no effect since the value is already in there.  But it works, and is reasonably simple...
 
Carey Brown
Saloon Keeper
Posts: 10705
86
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Mike, I'm impressed. It's taken me a while to wrap my head around it but I think I've got it. I was not familiar with NavigableMap before but I've used TreeMap. This is a slick use of TreeMap that I was unaware of.

A few things I noted:

It seems that your cumulative weights, to work properly, would need the initial Map of weights in increasing weight order. Correct? Otherwise it would seem like your "cumulativeWeight" would have large jumps and gaps.

I see "cumulativeWeight" as "a" way of distributing keys on a floating point scale. I could see using some function f(w) that might compute a weight key based on some formula, e.g. squared(w).

The scenarios I've imagined, e.g. playlist, should have unique keys but the value, the weight, is probably not unique for a number of different scenarios. In which case it seems like you wouldn't want to accumulate duplicate weights and for the current cumulativeWeight key the value would need to be a List<T>.

Thanks for your effort it gave me a new way to view the problem.
 
Mike Simmons
Master Rancher
Posts: 4806
72
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Carey Brown wrote:It seems that your cumulative weights, to work properly, would need the initial Map of weights in increasing weight order. Correct? Otherwise it would seem like your "cumulativeWeight" would have large jumps and gaps.



No, while the individual weights are not assumed to be in any order, the cumulative weight is always increasing.  Assuming all weights are positive, at least.  Each time we put a new cumulativeWeight into the map, we've added another weight to it, so it keeps going up.

Imagine the following map of weights:

A -> 6
B -> 3
C -> 1

In this order, when we make the inverse cumulative distribution, we get:

0 -> A
6 -> B
9 -> C

And total weight is 10.

Which means that when we generate a random number in the range 0 <= x < 10 it will be interpreted as:


0 <= x < 6  --> A (60% chance)
6 <= x < 9  --> B (30% chance)
9 <= x < 10 --> C (10% chance)

So the probabilities are as we require.



What if the initial weight map had been in the reverse order?

C -> 1
B -> 3
A -> 6

Then our inverse cumulative distribution would be:

0 -> C
1 -> B
4 -> A

And the total weight is again 10.

Which means that when we generate a random number in the range 0 <= x < 10 it will be interpreted as:

0 <= x <  1 --> C (10% chance)
1 <= x <  4 --> B (30% chance)
4 <= x < 10 --> A (60% chance)

With the result that the probabilities of each outcome are the same as with the first sort order.  The cumulative distribution is different, but the overall probabilities for each outcome would be the same.

The same applies to any other sort order you can imagine for the original map.

Now I think there is some benefit to sorting the map in increasing order - the calculation of total weight will be more accurate, because with floating point it's better to add small numbers before large numbers.  But that should be a minor effect; I'm not sure it's worth the trouble to do the sort.

Carey Brown wrote:I see "cumulativeWeight" as "a" way of distributing keys on a floating point scale. I could see using some function f(w) that might compute a weight key based on some formula, e.g. squared(w).



Yep!  Technically the cumulative function is the integral of the weight function.  Lots of fun calculus problems behind this, if you're interested.

Carey Brown wrote:The scenarios I've imagined, e.g. playlist, should have unique keys but the value, the weight, is probably not unique for a number of different scenarios. In which case it seems like you wouldn't want to accumulate duplicate weights and for the current cumulativeWeight key the value would need to be a List<T>.



No, the cumulative weight, always increasing, will be unique as long as the individual weights are positive.

Carey Brown wrote:Thanks for your effort it gave me a new way to view the problem.



You're welcome!

I still want to improve it to minimize the number of retries it has completing the sort...
 
Carey Brown
Saloon Keeper
Posts: 10705
86
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:

Carey Brown wrote:The scenarios I've imagined, e.g. playlist, should have unique keys but the value, the weight, is probably not unique for a number of different scenarios. In which case it seems like you wouldn't want to accumulate duplicate weights and for the current cumulativeWeight key the value would need to be a List<T>.


No, the cumulative weight, always increasing, will be unique as long as the individual weights are positive.


I  don't think I explained myself well enough. You have a key "A", in my world I have a whole list of keys with the same weight. So "A" could be implemented as a List.

My inept description was a way to keep individual key/weights in the Map in weight order but as they are processed keep all the keys of the same weight together as a List.
 
Mike Simmons
Master Rancher
Posts: 4806
72
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ah, I see.  Your explanation was probably fine; I just read too quickly.  Yes, if there are many duplicate weights, that might be a more efficient way to do it.  Though I don't think it's much of a problem to have duplicates, really, unless we're running up against memory requirements.
 
Carey Brown
Saloon Keeper
Posts: 10705
86
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I re-read your explanation on cumulative vs probability and now I see why duplicates are a non-issue. Thanks for your patience.
 
Carey Brown
Saloon Keeper
Posts: 10705
86
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Here's a possible reason for keeping a List<T> for each unique weight (assuming duplicate weights would appear often otherwise).
Here the while loop will repeat until the set has been completely filled. For a very large amount of data waiting for random numbers to hit all the slots may take a while. Whereas with fewer slots and each slot manages a List<T> it's not a big deal.
 
Mike Simmons
Master Rancher
Posts: 4806
72
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
New and improved (I think):

"quantileMap" is the renamed inverseCumulativeDistribution.  The Quantile function is another term for inverse cumulative distribution function.  

With a RECALCULATION_RATIO of 2, this will recalculate the quantileMap and related fields whenever more than half (by weight) of the map is also in the usedSet. Each time we recalculate, we discard all entries marked "used". That way we do more work along the way, but don't have a bunch of retries at the end.  I haven't measured to see how it compares to the original version; I'm just going by gut instinct here. It's entirely possible the first version performs better overall.
 
Bartender
Posts: 5465
212
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Also fun is to have a random Comparator, one that breaks all the Comparator rules.

For instance:

(in Record)

and then:

I haven't done any statistic controls yet to see if this meets the requirements. Like if we have "a" 6 AND "b" 4, whether in 60 % of the cases a comes before b.
 
Mike Simmons
Master Rancher
Posts: 4806
72
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Carey Brown wrote:Here's a possible reason for keeping a List<T> for each unique weight (assuming duplicate weights would appear often otherwise).
Here the while loop will repeat until the set has been completely filled. For a very large amount of data waiting for random numbers to hit all the slots may take a while. Whereas with fewer slots and each slot manages a List<T> it's not a big deal.


So, on the one hand, this is the thing that was bothering me about the first version of the code.  The new version periodically recalculates the map in order to get rid of all the entries for elements that have already been chosen.  It's a tradeoff, but I think it's worth it.

On the other hand... this comment makes me wonder if I've misunderstood your requirements.  If you have a list of elements with the same priority, once you chose that list, do you need to keep everything in the list together in the shuffle?  So if you have

A -> 10
B -> 1
C -> 1

Are you saying that you can have results like

ABC
ACB
BCA
CBA

But not

BAC
CAB

?

That would change the results substantially, I think.  But if that's not what you mean, I'm not sure how it would work, putting all elements with the same weight in a list.
 
Carey Brown
Saloon Keeper
Posts: 10705
86
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:The new version periodically recalculates the map in order to get rid of all the entries for elements that have already been chosen.  It's a tradeoff, but I think it's worth it.


I have one list in mind that potentially has tens of thousands of entries so I think the trade off is well worth it.
 
Carey Brown
Saloon Keeper
Posts: 10705
86
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:On the other hand... this comment makes me wonder if I've misunderstood your requirements.  If you have a list of elements with the same priority, once you chose that list, do you need to keep everything in the list together in the shuffle?  So if you have
A -> 10
B -> 1
C -> 1

In this example B & C would be in a list under weight '1'. So retrieving the next element involves two random numbers, the first selects a random weight from the TreeMap as you already have laid out. This gets you to a list of one or more elements all of the same weight where you use a second random number to get to a specific element in the list while also removing it from the list.

So if you show the key of weight first it would look more like:
10 : [ A ]
1 : [ B, C ]

I can see the implementation clearly enough but I'm less clear on whether this approach would change the probabilities.
 
Carey Brown
Saloon Keeper
Posts: 10705
86
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Had to think about it for a bit but I don't think the list will work, at least, not in the way I envisioned them.
If you have
A -> 10
B -> 1
C -> 1
As lists it would become
10 : [ A ]
2 : [ B, C ]
And if '2' came up  randomly and 'B' was selected then it would all need to be recalculated as
10 : [ A ]
1 : [ C ]
Best to stick with Mike's last incarnation.
 
reply
    Bookmark Topic Watch Topic
  • New Topic