wood burning stoves 2.0*
The moose likes Java in General and the fly likes Large Files Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Large Files" Watch "Large Files" New topic
Author

Large Files

colin shuker
Ranch Hand

Joined: Apr 11, 2005
Posts: 744
Hi, this is not so much of a java question, but I'll be doing it using java.

I will have say 20,000,000 strings, each one around 30 characters in length.

I want to store them on the server, so java program can access them.

So for example, if the 17,000,000 string was required, I could in theory, read the 17,000,000th line from one big text file.

I don't think this will be very quick though, so I thought..
Perhaps break the 20,000,000 strings into 1000 files of length 20,000. Then read required line from the required file.

Can anyone think of an alternative to this, prefably faster and talking up less space.

Thanks
Ralph Cook
Ranch Hand

Joined: May 29, 2005
Posts: 479
I can probably think of 100 alternatives to this, most of them irrelevant to the program you are presumably trying to write.

My point is there is not enough information here to advise on a good way to do this (as opposed to just another way to do this); It will depend on lots of things. How often do you expect to do this? Are the strings all equally likely to be needed? Is this part of a program that is liable to need it to optimize memory use? Or I/O? Or CPU? Is it going to run for a long time, or does your program do this once and then go off and do something else most of the time?

rc
colin shuker
Ranch Hand

Joined: Apr 11, 2005
Posts: 744
Yes, good point, I thought I would omit the details for clarity...

Its an opening book for my chess engine. Each entry contains a 64bit zobrist key of the position together with
a small selection of possible moves and weights.
I can wrap each entry into 1 number of about 30 digits, or less in hexadecimal.

So the opening book file(s) will only be read at most 12 times (during the start of the game), say once a minute for 12 minutes.

But I would still like it to perform quickly, say under 1 second, just to keep things fast.

Have also just been looking at RandomAcessFile in java, and this might be a good way to do it.

Also, I don't really want to be loading the file into the java program cause I need as much memory as I can for other parts of the program that take up big arrays.


Thanks again for any advice.
Bear Bibeault
Author and ninkuma
Marshal

Joined: Jan 10, 2002
Posts: 61095
    
  66

Database.


[Asking smart questions] [Bear's FrontMan] [About Bear] [Books by Bear]
colin shuker
Ranch Hand

Joined: Apr 11, 2005
Posts: 744
Can you be a bit more specific please, thanks
Bear Bibeault
Author and ninkuma
Marshal

Joined: Jan 10, 2002
Posts: 61095
    
  66

I think you'd be better of using a database rather than files for this. Databases are designed to quickly look up records in a large dataset.
Ralph Cook
Ranch Hand

Joined: May 29, 2005
Posts: 479
While I agree that this is a major use of databases, I can see wanting to avoid a general-purpose relational database management system in this case.

RDBMs are built, necessarily, for the general case; they occupy large amounts of memory and take up extra processing power in order to make things flexible. They are good at what they do in general, and it is possible that this would be a good, or at least a possible, solution to this problem. But I would worry about saddling my heap with the objects generated by the RDBMS, which I could not control, for a chess-playing program.

A chess-playing program is one of these things that occupies all your available memory and processing power and screams for more. I would be careful about putting an RDBMS in one; if I did, I would be careful to abstract all use of it so I could replace it with a special-purpose equivalent with a minimum of trouble.

I've not done anything significant with random-access files in java, but from reading the runtime javadoc it appears they may suit your case. You will need some way to translate your key into the position you want to seek, and of course you want to minimize seeks. If it were me, I would do tests on multiple seeks in different size files, preferably on the most likely target OS, to try to determine if the splitting into different files made sense.

I would guess that opening a file would be expensive compared to seeking in one that was open, and that reading would be less expensive than either of those.

Good luck with it!

rc
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14116
    
  16

A relational database is not necessarily a huge piece of software that uses massive amounts of disk space and / or processing power.

You could use something like HSQLDB or Apache Derby, both small relational database systems that you can even run embedded in your application (which means that the database server runs in the same JVM as your application, not as a separate process that you have to connect to).

Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 7 API documentation
Scala Notes - My blog about Scala
Elchin Asgarli
Ranch Hand

Joined: Mar 08, 2010
Posts: 222

What about SQLite?


Personal page, SCJP 6 with 91%, SCWCD 5 with 84%, OCMJD
Ralph Cook
Ranch Hand

Joined: May 29, 2005
Posts: 479
Certainly there are smaller and larger RDBM systems, and I have not made any survey of which ones are and are not large and so forth. I have some points that I still think are relevant here, however:

1. Any RDBM system is general purpose, and in order to maintain general-purpose flexibility, a system usually has to use more CPU cycles, memory, etc., in comparison with special-purpose code.

2. An RDBMS that is regarded as "small" is usually being compared to other RDBMS, not to doing the same job for a specific purpose with code crafted for that purpose.

3. The purpose for which the OP wants this is VERY limited for an RDBMS, and it does not seem difficult to fulfill the purpose without an RDBMS.

4. If you use any RDBMS, you lose *some* control over the use of CPU and memory that you can keep better if you craft the code for your specific purpose.

5. The program the OP is writing has EXTREME needs in both CPU and memory use. So it makes sense to examine carefully any commitment made in either of these areas at the outset.

As I said, some rdbms *might* fulfill what he needs, but I would make more sure than usual that I could detach the entire RDBMS and replace it with specific-purpose code if I ever expected it to, for instance, play tournament chess at any level.

rc
 
 
subject: Large Files