Hi Folks, I have a problem with large files and I was wondering if anyone here could give some suggestions for a solution.
I have a very large file that contains possibly several thousand objects. They are serialized using ObjectOutputStream. The file could be as much as 1GB or greater! Each object is also quite large, possibly 5000KB per object. If I want to find an object in this file, I read each object until I find the one I need throwing the other objects away as soon as they are read. Obviously I cannot keep anything in memory as that would bring down the machine.
So I was wondering if it's possible to somehow keep a pointer in the file so that I can search faster. Using some other serialization is not an option. It must be done with ObjectOutputStream, so I am assuming that it must be read back using ObjectInputStream. But maybe there is another way?
One option is to split the files up into smaller chunks. Then if I know that I need object 101, I can go to the file that I know contains object 101. But still I have to read each object untit I get to object 101. I would rather keep just one file, but if I cannot find another option I will have to do this.
Logical error? There is no error. It is very simple. I use ObjectOutputStream to write and ObjectInputStream to read. Try to read 1000 large objects from a large file. It will take some time.
I guess what I am looking for is ideas on using something else besides ObjectInputStream to read the objects back. Some way to quickly move a pointer to a certain position in the file so I don't have to read each object to get to the object I want. I am thinking of RandomAccessFile, but I am not sure about where to the place the pointer, i.e. where does one object end and another begin. Then there is constructing an object out of the bytes read, because I would no longer have the convenience of readObject().
Hopefully it is clear what I am trying to do? Any ideas?
This is a design problem. ObjectOutputStream is provided simply for saving an object's state. It doesn't have the functionality which matches your requirements. If you can't change the storage mechanism you are stuck. Now, if you can change the storage mechanism, say serializing objects to a RandomAccessFile, SQL database or pure object database, we could talk about some options.
Goin over my head here ... extend FileOutputStream, override write to count the bytes it writes. Write an object, get the count, index of next object is count+1, write an object, get the count, etc. Then can you seek or skip bytes when it's time read back in?
A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
An important question here is: do these objects contain references to one another? If they do, you will have a very, very hard time trying to read the serialized objects out of sequence. Even if they don't contain such references to each other - there's a very high possibility this will never work. Your objects may well refer to other objects which are shared. E.g. if you have a class with a String field, and two instances of your class refer to the same String - you will have a very hard time reading the second instance unless you read the first instance first. Because the shared string will get serialized as part of the first instance, and the second instance will just serialize a reference to the first one. That's a rather imprecise description, and I'm not sure of all the details myself, but I think it's extremely unlikely you will be able to achieve what you're asking for. The object serialization protocol is very much designed for sequential access, period. Skipping steps is not really an option for objects which were all serialized together using the same ObjectOutputStream.
I think that by far, your best option is the one mentioned in the fourth paragraph of your first post here. Read the entire huge file once, and write a separate file for each object. Yes, it will take some time - but you only have to do it once. Then you never use the huge file again.
Note that if your objects do contain extensive references to each other (not just to a few small shared objects like Strings) then this option won't really work either, as trying to serialize one object will end up serializing them all. In which case forget about trying to write separate files - each one will be as big as the original. You will just have to read the entire file and keep everything in memory. Or find some other way of storing your data besides object serialization.