• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Bear Bibeault
  • Junilu Lacar
  • Martin Vashko
Sheriffs:
  • Jeanne Boyarsky
  • Tim Cooke
  • Knute Snortum
Saloon Keepers:
  • Ron McLeod
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Scott Selikoff
  • salvin francis
  • Piet Souris

Cartesian product of given vectors | Ways to improve performance

 
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I was brushing up some fundamentals and was working on finding all the permutation and combination of given vector of vectors. Following is my code:



The code works fine with small input, but as soon as I increase the input data, I get out of memory error. Is there a way to improve this code? Or make it more performant?
 
Saloon Keeper
Posts: 10853
234
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Well, the example input you gave will just result in a ridiculously large product. Your memory complexity is in Ө(m^n), where m is the number of elements in the largest set, and n is the number of sets.

Even if you used an optimized algorithm that could construct the product in a reasonable amount of time, you still need to keep 34,828,517,376 vectors in memory. If we assume that a reference is 4 bytes large, you need around 130 GB to store just the references to the vectors, and then we haven't even considered the contents of the vectors.

What you want is simply not feasible.
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:Well, the example input you gave will just result in a ridiculously large product. Your memory complexity is in Ө(m^n), where m is the number of elements in the largest set, and n is the number of sets.


Understood.

Stephan van Hulst wrote:Even if you used an optimized algorithm that could construct the product in a reasonable amount of time, you still need to keep 34,828,517,376 vectors in memory. If we assume that a reference is 4 bytes large, you need around 130 GB to store just the references to the vectors, and then we haven't even considered the contents of the vectors.



My apologies for diverting this thread to other direction, but can you please tell me how did you calculate this?

In my dataset: m=9, and n=12. Hence m^n would be: 2,82,429,536,481.

How did you calculate 130 GB?


Stephan van Hulst wrote:What you want is simply not feasible.


Okay. If I reduce the input size from 12 sets to 4 sets. How would I optimize my cartesian method?
 
Bartender
Posts: 3666
151
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I don't know your requirements, but must you store the full Cartesian Product? For any quadruple (x, y, z, w) it is easy to calculate the corresponding cp.
 
Marshal
Posts: 66507
251
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Punit Jain wrote:. . . In my dataset: m=9, and n=12. Hence m^n would be: 2,82,429,536,481.

How did you calculate 130 GB?

. . . Stephan said assuming 4 bytes per reference, which makes 1,129,718,145,924, which is approx. 1130GB, even more unmanageable than he thought initially. Of course, in order to store than many references, you would need a 64‑bit file system, and that will double the size of the collection. Since that amount of storage is only feasible in a linked list or deeply nested arrays, that will increase the storage requirement yet again. I shall let you work out how long it would take to iterate a 280GB linked list; have you got a few days to spare?

By the way: instead of using ^, look at this old thread and find out how you can write 9¹² or mⁿ.
 
Stephan van Hulst
Saloon Keeper
Posts: 10853
234
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The Big Theta expression I gave expresses an order of growth, it's not a function you can use to calculate the actual memory requirements.

The input you used has a set of 9 elements, a set of 6 elements a set of 8 elements, and repeats this 4 times.

The number of times that the inner loop is executed is (9*8*6)^4. That's also the number of vectors stored. Multiply by 4 to get the number of bytes, and divide by 1024^3 to get the number of gigabytes.
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Souris wrote:I don't know your requirements, but must you store the full Cartesian Product? For any quadruple (x, y, z, w) it is easy to calculate the corresponding cp.



But eventually, I will need the full cartesian product. In order to have all the combinations, I will have to have all the sets. Can you please elaborate?
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:. . . Stephan said assuming 4 bytes per reference, which makes 1,129,718,145,924, which is approx.



My bad. Got it now.

Campbell Ritchie wrote:Since that amount of storage is only feasible in a linked list or deeply nested arrays, that will increase the storage requirement yet again. I shall let you work out how long it would take to iterate a 280GB linked list; have you got a few days to spare?



Since my system RAM is just 6 GB, it will never run on my system.  
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:The Big Theta expression I gave expresses an order of growth, it's not a function you can use to calculate the actual memory requirements.

The input you used has a set of 9 elements, a set of 6 elements a set of 8 elements, and repeats this 4 times.

The number of times that the inner loop is executed is (9*8*6)^4. That's also the number of vectors stored. Multiply by 4 to get the number of bytes, and divide by 1024^3 to get the number of gigabytes.



Cool. Thanks so much.
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If I reduce the number of sets to 3-4 in my input data. Can this code be optimized?
 
Sheriff
Posts: 24729
59
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Punit Jain wrote:But eventually, I will need the full cartesian product. In order to have all the combinations, I will have to have all the sets. Can you please elaborate?



Do you need the full cartesian product, in memory, all at the same time?

Remember that there are about 35,000,000,000 vectors in that full cartesian product. We've already established that they don't all fit into memory. So perhaps you don't need them all at the same time. Maybe you don't actually expect to use all 35 billion of them, so you could just calculate the ones you actually use, at the time you use them.

 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Paul Clapham wrote:

Punit Jain wrote:But eventually, I will need the full cartesian product. In order to have all the combinations, I will have to have all the sets. Can you please elaborate?



Do you need the full cartesian product, in memory, all at the same time?

Remember that there are about 35,000,000,000 vectors in that full cartesian product. We've already established that they don't all fit into memory. So perhaps you don't need them all at the same time. Maybe you don't actually expect to use all 35 billion of them, so you could just calculate the ones you actually use, at the time you use them.



Let say I have 12 sets. Out of those 12, I want the cartesian product of only 2 sets, so I will compute only for those 2. Now tomorrow I want for another 2 then compute for another 2. Basically filtering my data based on my need. Is that what you mean? Correct me if I am wrong,
 
Paul Clapham
Sheriff
Posts: 24729
59
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Souris wrote:I don't know your requirements...



I don't know your requirements either. I only know that the fact is that you can't store all of the data in memory, and you don't have enough time to store it all somewhere else. So computing relevant subsets as they are needed might be a good strategy. Which subsets are the relevant ones, I can't tell.
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Paul Clapham wrote:

Piet Souris wrote:I don't know your requirements...



I don't know your requirements either. I only know that the fact is that you can't store all of the data in memory, and you don't have enough time to store it all somewhere else. So computing relevant subsets as they are needed might be a good strategy. Which subsets are the relevant ones, I can't tell.



Okay. my bad I should have written this earlier. I don't have exact requirements. I just picked a random problem wrote some code and now trying to optimize it for my own understanding. Will be helpful for me to crack interviews as well as I am about to graduate.  
 
Paul Clapham
Sheriff
Posts: 24729
59
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Punit Jain wrote:Okay. my bad I should have written this earlier. I don't have exact requirements. I just picked a random problem wrote some code and now trying to optimize it for my own understanding.



Yes, I suspected that was the case. Working on "random problems" is a useful way to improve your skills, but when you don't have any requirements then you can't really ask what you should do.
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Paul Clapham wrote:

Punit Jain wrote:Okay. my bad I should have written this earlier. I don't have exact requirements. I just picked a random problem wrote some code and now trying to optimize it for my own understanding.



Yes, I suspected that was the case. Working on "random problems" is a useful way to improve your skills, but when you don't have any requirements then you can't really ask what you should do.



Okay. Let me add specific questions.
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay so here are some requirements/questions:

Following is my code:



Following is the output:

Total time took: 0.006
Total combinations in memory: 3888
Total space (bytes): 342992


I understand above output is not the time and space complexity in terms of Big O. In terms of Big O, the worst case space complexity would be O (mⁿ) as Stephan also mentioned is in his answer. The time complexity would also be same.

Questions:

1.) Is there a way to improve the performance? Would there be a better data structure for this?
2.) Is there a way to improve the Big O?
3.) I have three nested loops in my code. Can they be reduced?

Note: To get the space I am just doing some simple computation. Space is approximate.  
 
Stephan van Hulst
Saloon Keeper
Posts: 10853
234
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You still haven't given requirements. Questions about performance are meaningless as long as it's unknown what a correct program should output.
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:You still haven't given requirements. Questions about performance are meaningless as long as it's unknown what a correct program should output.



It should output all the combinations of given input (4 sets of String). Basically instead of void, will return combinations in cartesian(String[][] input).
 
Stephan van Hulst
Saloon Keeper
Posts: 10853
234
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay, so you don't actually need to keep them in memory. You can just iterate through the combinations and output them directly. Note that this can still take an incredibly long time (especially when printing to a file or to standard output, I/O operations are very expensive).

I wrote an application that performs a given operation on all combinations. It doesn't keep combinations it's already seen in memory, it just uses them and then allows them to be garbage collected. For the data set that you gave in your last post, it prints a solution instantaneously if the operation given is a simple count:

It runs in approximately 2 seconds if instead of counting all combinations, I print them all:
 
Marshal
Posts: 7304
497
Mac OS X VI Editor BSD Linux
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@OP If that's for an interview, avoid such writings as: if (input.length <= 0), how that can ever be less than 0?
 
Stephan van Hulst
Saloon Keeper
Posts: 10853
234
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I just calculated that my application, when using AtomicLong instead of AtomicInt, should actually be able to count all combinations in the original dataset in roughly 2½ hours. I'm letting it run to see if I'm right. I'll report back later.
 
Stephan van Hulst
Saloon Keeper
Posts: 10853
234
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I didn't actually expect my guess to be this close. :P
just-over-two-and-a-half.png
[Thumbnail for just-over-two-and-a-half.png]
 
Piet Souris
Bartender
Posts: 3666
151
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I had this in my library, but I never checked for efficiency:

(@Stephan: that was from before I promised to practise ? super T, ? extends T   )
 
Piet Souris
Bartender
Posts: 3666
151
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Punit Jain wrote:

Piet Souris wrote:I don't know your requirements, but must you store the full Cartesian Product? For any quadruple (x, y, z, w) it is easy to calculate the corresponding cp.



But eventually, I will need the full cartesian product. In order to have all the combinations, I will have to have all the sets. Can you please elaborate?


Well, I wondered why you would need the full Cartesian Product.

But what I meant was: if you have N lists, with sizes S(1), ...., S(N), then you can convert easily each long L into the N-tuple (i1, i2, ..., iN), and the corresponging product would then be simply List(1).get(i1) + ... +List(N).get(iN).
so you can't have a more compact representation then just the input lists.

Can you give a use case where you need to have available the full Cartesian Product?
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:

   prefixes.forEach(prefix ->
     forEachCombination(remainingInput, smallerCombination -> {
       List<T> combination = smallerCombination;
       combination.add(0, prefix);
       action.accept(combination);
     })
   );
 }


[/code]



Okay. I am totally lost here. So its recursively calling forEachCombination for the number of sets and returning every combination to the main. But how exactly its working? Also, is it possible to convert it to an iterative solution?  In my solution, instead of storing all the combinations into combinations vector, I should just return every combination?
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Liutauras Vilda wrote:@OP If that's for an interview, avoid such writings as: if (input.length <= 0), how that can ever be less than 0?



Yeah, it will never be less than 0.
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Souris wrote:
Well, I wondered why you would need the full Cartesian Product.



What I was thinking was, I will have all the combinations in memory and whenever I need one, I will get it from memory but it's inappropriate.

Piet Souris wrote:
Can you give a use case where you need to have available the full Cartesian Product?



Not really sure because I just picked this problem randomly.
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Souris wrote:
But what I meant was: if you have N lists, with sizes S(1), ...., S(N), then you can convert easily each long L into the N-tuple (i1, i2, ..., iN), and the corresponging product would then be simply List(1).get(i1) + ... +List(N).get(iN).
so you can't have a more compact representation then just the input lists.



I still didn't get it.

Any example, please.
 
Paul Clapham
Sheriff
Posts: 24729
59
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Punit Jain wrote:

Piet Souris wrote:
Well, I wondered why you would need the full Cartesian Product.



What I was thinking was, I will have all the combinations in memory and whenever I need one, I will get it from memory but it's inappropriate.



I agree. I can't imagine that some requirement will say that you need a particular element of the Cartesian product. If you can say I need [A, B, C, D], then you already have [A, B, C, D]. Unless you've built some kind of index which enables you to say I need element number 2,037,655 or something like that. Which all seems quite improbable to me.
 
Campbell Ritchie
Marshal
Posts: 66507
251
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Paul Clapham wrote:. . . I need element number 2,037,655 or something like that. Which all seems quite improbable to me.

I can imagine circumstances under which that would be a reasonable requirement. I have a model checker for the Dining Philosophers problem in a procedural language, and that reduces the state of the philosophers and their forks to a single int. You convert it back with the remainder operator (called modulo across the Channel). If you want elements A₁, B₂, C₃ and D₄, and the sizes are A, B, C, D = 10 11 12 13, you can calculate the index with 4 + 3 × 13 + 2 × 13 × 12 + [1 × ]13 × 12 × 11. That will be out of 13 × 12 × 11 × 10 = 17160 discrete indices. As long as the number datatype you are using is large enough that you don't suffer an overflow error, you can reverse the procedure and calculate element number 2,037,655 in constant time. If we have input vectors size 7 8 9 10 11 12 13, you would use 2,037,655 % 13, 2,037,655 ÷ 13 % 12, 2,037,655 ÷ 13 ÷ 12 % 11, etc. as the indices, the first result on the right and the last result on the left. Those indices would multiply to give 13! ÷ 6! = 8648640 different combinations.

At least I think I have got the arithmetic right.
 
Stephan van Hulst
Saloon Keeper
Posts: 10853
234
  • Likes 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Punit Jain wrote:Okay. I am totally lost here. So its recursively calling forEachCombination for the number of sets and returning every combination to the main. But how exactly its working?


forEachCombination(input, action) takes a list of collections and an action, and will perform the action on each combination in the Cartesian product of all the collections in the input list. It does this by removing the first collection from the input list, creating smaller combinations from the remaining collections, and then prefixing each smaller combination with an element from the first collection that was removed to create a larger combination. Then it calls the action to be performed on that larger combination. It creates the smaller combinations by calling itself recursively, but instead of the original action to be performed, it supplies a new action that simply builds the combination.

It doesn't return anything to the main method. The main method supplies an action that needs to be performed on every combination. After the action is performed, the combination is thrown away.

Also, is it possible to convert it to an iterative solution?


You can convert recursive solutions to iterative solutions by making the recursion stack explicit. However, the code for that usually isn't pretty and the recursive solution is much easier to understand. If you're not comfortable with recursion, I strongly recommend that you practice because it's an incredibly important tool.

In my solution, instead of storing all the combinations into combinations vector, I should just return every combination?


How are you going to return every combination? You can't return every combination without storing them somewhere, and you can't return individual combinations from a method that is supposed to return all of them (you CAN do this in C#, which is a pretty cool feature).

You could pass an action from the main method that you want to perform on every combination, but because your algorithm walks through the tree of partial combinations breadth-first, it still needs to keep all partial combinations in memory, which again will crash your application. The fact of the matter is that you can't solve this problem in Ө(n) space complexity (n being the number of sets in the input, the size of those sets being of no consequence) without using recursion or a stack machine.
 
Stephan van Hulst
Saloon Keeper
Posts: 10853
234
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Piet Souris wrote:I had this in my library, but I never checked for efficiency:


You can reduce its memory requirements by deferring all the terminal operations. Get rid of the reduction and collection operators, and just return streams. If at the very end you have a stream of combinations, the client can use their own reduction operation.

(@Stephan: that was from before I promised to practise ? super T, ? extends T   )


I'll leave it as an exercise to you to explain why in my method signature I used extends twice in the type of the input parameter, but why I didn't write or for the action parameter.
 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:You can convert recursive solutions to iterative solutions by making the recursion stack explicit. However, the code for that usually isn't pretty and the recursive solution is much easier to understand. If you're not comfortable with recursion, I strongly recommend that you practice because it's an incredibly important tool.



Let me practice recursion. Will post the solution here.


Stephan van Hulst wrote:How are you going to return every combination? [/quote

I mean when I am making combinations, I will put my action in there to perform a task for the combinations. For example

 
Punit Jain
Ranch Hand
Posts: 1102
3
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator

Stephan van Hulst wrote:
You could pass an action from the main method that you want to perform on every combination, but because your algorithm walks through the tree of partial combinations breadth-first, it still needs to keep all partial combinations in memory, which again will crash your application. The fact of the matter is that you can't solve this problem in Ө(n) space complexity (n being the number of sets in the input, the size of those sets being of no consequence) without using recursion or a stack machine.



I see.
 
Tell me how it all turns out. Here is a tiny ad:
Java file APIs (DOC, XLS, PDF, and many more)
https://products.aspose.com/total/java
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!