This week's book giveaway is in the Agile and Other Processes forum.
We're giving away four copies of DevSecOps Adventures: A Game-Changing Approach with Chocolate, LEGO, and Coaching Games and have Dana Pylayeva on-line!
See this thread for details.
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Ron McLeod
  • Paul Clapham
  • Devaka Cooray
  • Tim Cooke
Sheriffs:
  • Rob Spoor
  • Liutauras Vilda
  • paul wheaton
Saloon Keepers:
  • Tim Holloway
  • Tim Moores
  • Mikalai Zaikin
  • Carey Brown
  • Piet Souris
Bartenders:
  • Stephan van Hulst

Using a file as a set.

 
Ranch Hand
Posts: 1071
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi all.

I need a Set implementation that can hold roughly 2.4 trillion items. I'm guessing the best way to do that is write a set that is back by a File (which is why I'm posting in this forum).

I'll basically be inserting all the items into the Set, and then reading them out and populating either another of the same kind of Set (for long term storage, a master list) or a database.

I'm looking for advise as the to best way to back the Set with a File. I'm also looking for sorted order, the data has a natural sort order I'll use, so it will wind up being some kind of tree set.

Any help would be appreciated, or it you think I'm on the wrong track let me know.
 
Ranch Hand
Posts: 1780
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
2.4 trillion -- as in 1,000,000,000,000? Isn't that why god invented databases? As far as the "set" requirement, couldn't you achieve that with a unqiueness constraint?
 
author and iconoclast
Posts: 24207
46
Mac OS X Eclipse IDE Chrome
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You do realize that even if you can read and process a million items a second, that processing 2.4 trillion items will take 667 hours? And that even at one byte per item, that one file would be 2.4 terabytes in size, pushing the limits of both consumer disk storage and some modern operating systems? I'm not saying this is not possible, I'm just saying that this is a very serious problem that will need some cutting-edge development to get working.
 
(instanceof Sidekick)
Posts: 8791
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think a file based system would just end up reinventing the database, or at least a very complicated indexed file. I remember enough mainframe VSAM internals to talk through some of the issues and challenges, but I consider it just enough to know better than to try it. The book for the Turbo Pascal Database Toolbox is a great overview of B-Tree indexed files.
 
Ranch Hand
Posts: 961
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I guess those 2.4 trillion items are certainly coming out from somewhere.

Where are you currently storing all that data and why can you not simply manipulate it in its current storage?
 
Steven Bell
Ranch Hand
Posts: 1071
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I do understand the size of the problem, although I didn't think to calculate the disk space needed, I stopped at the 'This just isn't going to fit in memory' stage. Maybe a little more information would help.

I'm working on a program that generates Sudoku solutions (it doesn't generate puzzles or solve puzzles, just solutions). Each valid solution can be modified to create a new valid solution. The possible permutations run just over 2.4 trillion. Once I get the algorithms all worked out (which I'm about half way there), I plan to break things up and have the work done in a grid system. Currently I can generate around 1 million permutations in 5 seconds.

I need the set restraints as a validation that the permutations have run correctly. I know exactly how many unique permutations there are, and if the numbers don't match I have a problem.

This all started when I had the idea that there are only a small number of unique patterns in the solution of sudoku puzzles. Not sure if that's true yet, but that's part of what I want to find out.

Considering the disk size issue, maybe I need to look at some sort of compressed storage.

P.S. For the interested the permutations run as follows:
Each set of row (1-3, 4-6, 7-9) can be swaped around without invalidating the puzzle, same with columns. Also each section can be swaped around, same with each column section. Then you can swap any two (or more) symbols and still have a valid puzzle (swap all the 2's and 4's for example). Also the puzzle can be rotated

This leads to 3!*3!*3!*3! = row permutations, same for column permutations
4 roational positions.
9! symbol swaps.

1296*1296*4*362880 = 2,437,996,216,320
 
Marshal
Posts: 28304
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Currently I can generate around 1 million permutations in 5 seconds.

Time to bring out the envelope again so we can do some calculations on the back of it. So 2.4 trillion permutations is 2.4 million times that, or 12 million seconds. That's somewhere over 3,000 hours -- close to 20 weeks in fact.
[ May 10, 2006: Message edited by: Paul Clapham ]
 
Steven Bell
Ranch Hand
Posts: 1071
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Paul Clapham:
Time to bring out the envelope again so we can do some calculations on the back of it. So 2.4 trillion permutations is 2.4 million times that, or 12 million seconds. That's somewhere over 3,000 hours -- close to 20 weeks in fact.

[ May 10, 2006: Message edited by: Paul Clapham ]



That's why I said I'll be moving the processing to a grid computing system. This project is born part of the sudoku habit and part of my desire to play with grid computing.

And I'm not overly concerned with the time issue here. This is just a fun project for me.

P.S. The 2.4 trillion is just the permutations from one valid solution. There will many valid solutions. That number is actully one of my goals here.
[ May 10, 2006: Message edited by: Steven Bell ]
 
Steven Bell
Ranch Hand
Posts: 1071
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm thinking I'm solving the wrong problem here.

Rather than finding a solution, computing all it's permutations and storing them, what I need to do is search for solutions and test them against previously found solutions to see if they are a permuataion or a new solution. It'll be much easier to manage memory resources, and there is no need to store permutations, only the unique solutions. I think this will also make it easier to break the work up into small units.

The only downside is there would be a fair amount of duplicate work done processing permutations to test new solutions.
 
Paul Clapham
Marshal
Posts: 28304
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

And I'm not overly concerned with the time issue here. This is just a fun project for me.

Well, do have fun. But keep the back of that envelope handy. From what I've seen, there does seem to be a small number of patterns in published Sudoku puzzles, but maybe that's more because it's easy to generate them that way. But to get there, you're going to have to go past enumerating all possible solutions and start enumerating all solvable problems.
 
All of life is a contant education - Eleanor Roosevelt. Tiny ad:
New web page for Paul's Rocket Mass Heaters movies
https://coderanch.com/t/785239/web-page-Paul-Rocket-Mass
reply
    Bookmark Topic Watch Topic
  • New Topic