This week's giveaway is in the Android forum.
We're giving away four copies of Android Security Essentials Live Lessons and have Godfrey Nolan on-line!
See this thread for details.
The moose likes Java in General and the fly likes Fortune: Best way to access them? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Fortune: Best way to access them?" Watch "Fortune: Best way to access them?" New topic
Author

Fortune: Best way to access them?

Antonio Rodriguez
Greenhorn

Joined: Dec 10, 2003
Posts: 3
I'm learning some advanced Java by trying to recreate a fortune-like program. I plan to have a GUI interface (somewhat like the freeware SmartBee). My main dilemma is lanning how to access a fortune quickly from a set of files. One solution I've encountered is creating an index file that rides alongside the fortune file, but I was hoping that there would be a more elegant solution that doesn't involve that.

The fortunes are just text files, with "%" and a carriage return used as separators. The idea is to choose randomly from all the fortunes contained in that file, or potentially choose a random fortune from a list of files. The main dilemma is that the number of fortunes is very large. One file I'm using has more than 110,000 different fortunes, so reading them all into some sort of collection isn't feasible.

Any help is deeply appreciated. A pointer to where I can find a good solution would be excellent.
[ June 24, 2004: Message edited by: Antonio Rodriguez ]
David Moose
Greenhorn

Joined: Jun 11, 2004
Posts: 4
I suggest download some free java database, load all the data into that database and use jdbc to select a fortune from that.
Tim West
Ranch Hand

Joined: Mar 15, 2004
Posts: 539
Or, grab the source of the fortune that comes on most Linux distros and see how that does it. Although it's probably in C, the general algorithm may well be translatable.



--Tim
Darin Niard
Ranch Hand

Joined: Jun 08, 2004
Posts: 118
I don't see why there would be a problem with using a collection. 110,000 strings isn't that much. If you are worried about lookup speed, use a Hashtable/Map.
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24183
    
  34

Originally posted by Darin Niard:
110,000 strings isn't that much.


At a 100 characters or so each, two bytes per character, plus say a dozen bytes of overhead per object, that's about 23 megabytes. Funny how the definition of "not that much" has changed!


[Jess in Action][AskingGoodQuestions]
Darin Niard
Ranch Hand

Joined: Jun 08, 2004
Posts: 118
:roll: Then what would you do?
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24183
    
  34

Well, the classic "fortune" program did it this way:

* Read in fortune #1, store in variable F.
* Read in fortune #2. With probabillity 1/2, assign it to F.
* Read in #3, and with probability 1/3, assign it to F.
* In general, for fortune #N, assign it to F with probability 1/N.

If you add up the fractions you'll see that each fortune has an equal chance of being selected.

This runs in time proportional to the number of fortunes, but constant space -- you never have more than two fortunes in memory. Actually, if you compute the probability first, you can null out F before reading in the new fortune, so actually you never need more than one. This was great for early UNIX back when memory was really scarce.

The more modern GNU fortune program uses the precompiled "index" files that Antonio originally postulated; this runs in both constant time and space, clearly preferable!
Ko Ko Naing
Ranch Hand

Joined: Jun 08, 2002
Posts: 3178
Originally posted by Ernest Friedman-Hill:


At a 100 characters or so each, two bytes per character, plus say a dozen bytes of overhead per object, that's about 23 megabytes. Funny how the definition of "not that much" has changed!


Nice calculation!!! Mr.Ernest is correct... How come should we put such big chunck of characters in a string object like this? I have met such kind of issue in the past before. For more info about the maximum byte that a string can contain, this thread made an interesting point for us here....


Co-author of SCMAD Exam Guide, Author of JMADPlus
SCJP1.2, CCNA, SCWCD1.4, SCBCD1.3, SCMAD1.0, SCJA1.0, SCJP6.0
Darin Niard
Ranch Hand

Joined: Jun 08, 2004
Posts: 118
Interesting, but... if the first fortune has a 50% probability then why would it show up any less than 1/2 the time? Or, am I not understanding you correctly? As for the index files, even if it picks the 5000th entry, does it not have to run through the first 4999 in order to find it? This is something about file i/o that's never been clear to me...
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24183
    
  34

Originally posted by Darin Niard:
Interesting, but... if the first fortune has a 50% probability then why would it show up any less than 1/2 the time? Or, am I not understanding you correctly?


You always read the whole file, and you perform this procedure for every fortune. The first fortune thus has up to N-1 chances to be replaced -- if it's not replaced by the second one, it may be replaced by the third one, or fourth one, and so on. And once the first one is replaced, the replacement can itself be replaced, as can that replacement. So the probability of the first fortune being selected is 1/2 - 1/3 - 1/4 - 1/5 - 1/6 ... - 1/N, which is equal to, as it turns out, 1/N.


As for the index files, even if it picks the 5000th entry, does it not have to run through the first 4999 in order to find it? This is something about file i/o that's never been clear to me...


No, you don't have to. The index file would have the file offsets of beginning of each fortune. You could use java.io.RandomAccessFile, seek() to the beginning of the chosen fortune (a more or less constant-time operation) and then read the correct number of bytes (computed by subtracting two successive start indices.)
Darin Niard
Ranch Hand

Joined: Jun 08, 2004
Posts: 118
Originally posted by Ernest Friedman-Hill:

You always read the whole file, and you perform this procedure for every fortune. The first fortune thus has up to N-1 chances to be replaced -- if it's not replaced by the second one, it may be replaced by the third one, or fourth one, and so on. And once the first one is replaced, the replacement can itself be replaced, as can that replacement. So the probability of the first fortune being selected is 1/2 - 1/3 - 1/4 - 1/5 - 1/6 ... - 1/N, which is equal to, as it turns out, 1/N.

Clever, but I think the way I've always picked a random line from a file can be slightly more efficient (n/2 average instead of n). Basically, I'd just count up to a random number between 0 and the fileLength and stop on that entry. That's what I thought of originally when he asked this question, but since it was running in a GUI, I figured he would be reusing the fortune list a lot, so speed seemed most important to him. The indexing way sounds like the best method though
[ June 24, 2004: Message edited by: Darin Niard ]
Ernest Friedman-Hill
author and iconoclast
Marshal

Joined: Jul 08, 2003
Posts: 24183
    
  34

Originally posted by Darin Niard:

Clever, but I think the way I've always picked a random line from a file can be slightly more efficient. Basically, I'd just count up to a random number between 0 and the fileLength and stop on that entry.


This only works if you know how many fortunes are in the file. The neat thing about the original fortune algorithm is that it doesn't require this knowledge -- it works in a single pass over never-before-seen data.
Darin Niard
Ranch Hand

Joined: Jun 08, 2004
Posts: 118
Yeah, I know. Thanks for letting me in on that algorithm. Good stuff.
Antonio Rodriguez
Greenhorn

Joined: Dec 10, 2003
Posts: 3
Originally posted by Ernest Friedman-Hill:
[QB]

You always read the whole file, and you perform this procedure for every fortune. The first fortune thus has up to N-1 chances to be replaced -- if it's not replaced by the second one, it may be replaced by the third one, or fourth one, and so on. And once the first one is replaced, the replacement can itself be replaced, as can that replacement. So the probability of the first fortune being selected is 1/2 - 1/3 - 1/4 - 1/5 - 1/6 ... - 1/N, which is equal to, as it turns out, 1/N.QB]


I get it. The idea is that for the first fortune to survive, it would have to be outside of the probabilities that we calculate. Consider the case in which we have 5 fortunes.

1 always gets placed in F.
2 will be placed in F 1/2 of the time. Therefore, 1 survives 1/2 of the time.
3 will be placed in F 1/3 of the time. Therefore, if 1 survived the last replacement, it would survive this replacement 2/3 of the time.
4 will be placed in F 1/4 of the time. Therefore, if 1 survived the previous replacements, it would survive this replacement 3/4 of the time.
5 will be placed in F 1/5 of the time, Therefore, if 1 survived the previous replacements, it would survive 4/5 of the time.

So the probability of 1 surviving is 1/2 * 2/3 * 3/4 * 4/5 = 1/5

Very nice programming. Kudos to the original fortune programmers, and thanks to those who responded so quickly to my question. I apologize for the late reply, but I have really sporadic Internet access.

I probably will go towards the index methodology, to get the constant space/time factor, but this was a great insight. Thank you, Mr. F-H!
 
jQuery in Action, 2nd edition
 
subject: Fortune: Best way to access them?
 
Similar Threads
What are the most important issues facing the US today?
problem using button event handler to update text in a frame
Importing
The way you explain your solution in the essay is more important than the solution itself
SCJD obstacles