This week's giveaway is in the EJB and other Java EE Technologies forum.
We're giving away four copies of EJB 3 in Action and have Debu Panda, Reza Rahman, Ryan Cuprak, and Michael Remijan on-line!
See this thread for details.
The moose likes Performance and the fly likes BitSet Serialization - Any idea to save disk space and better performance Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "BitSet Serialization - Any idea to save disk space and better performance " Watch "BitSet Serialization - Any idea to save disk space and better performance " New topic
Author

BitSet Serialization - Any idea to save disk space and better performance

Varun Chopra
Ranch Hand

Joined: Jul 10, 2008
Posts: 211
In the application I am working on, there are a lot of boolean operations involved. We have very large binary numbers (of the order of 16000 bits per number) stored in BitSet objects.
In one case we have 9000 BitSet objects - with 16000 bits each. We serialize these numbers to disk.

So first problem is serialized file is taking 16 MB space on disk. I converted BitSets into BigInteger objects, but still size is about 8 MB. Any other idea to store same info with less disk space?

Secondly, those bits are later on loaded into RAM for boolean operations. At that time I convert them back into BitSet. This conversion takes time, and also loading these 16 MB bitmaps into memory slows down the server. Since I need to perform each boolean operation on all 9000 bitsets, I can load partial data. Any ideas here to improve application performance?
[ November 07, 2008: Message edited by: Varun Chopra ]

-Varun -
(My Blog) - Online Certifications - Webner Solutions
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
So first problem is serialized file is taking 16 MB space on disk. I converted BitSets into BigInteger objects, but still size is about 8 MB. Any other idea to store same info with less disk space?
9000 objects * 16000 bits = 144 million bits = 18MB or 17.1MiB

So I can't see how you'll get storage less than 18MB even without serialisation overheads. If you have, it must be magic (or compressed).


Charles Lyons (SCJP 1.4, April 2003; SCJP 5, Dec 2006; SCWCD 1.4b, April 2004)
Author of OCEJWCD Study Companion for Oracle Exam 1Z0-899 (ISBN 0955160340 / Amazon Amazon UK )
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
On the performance side, you'll probably get better performance if you open a flat file (in your own simple file format, without serialisation) using RandomAccessFile. Then copy blocks of say 1MB at a time into RAM for fast processing. Modern CPUs have caches of between 1 and 6MB, so using those effectively will provide the fastest computation. Also with random access you can seek wherever you need to at any time (assuming, whatever file format you choose, you first read the file to determine the offsets of the BitSet objects).

Either way, having BitSets that are 16000 bits long sounds a bit (excuse the pun!) extreme. What are doing with binary that long? I'd be surprised if some careful re-design wouldn't optimise all those problems away.

To be honest, if you need speed, you're better off avoiding objects like BitSet and just sticking to plain arrays. CPUs can handle logical and bitwise operations on memory-mapped locations like arrays directly, and far more efficiently than using method invocations (which have overheads on the stack for example). Most operations you perform would probably only need a couple of clock cycles each. You'll also save time on instantiation of new objects, and on RAM by avoiding all the objects on the heap.

If you really need speed, you're going to have to write that part of the application in C. Allocate a contiguous block of memory using malloc and then access it via pointers directly. Java arrays aren't guaranteed to be contiguous which means they do some unnecessary pointer manipulation, and they're initialised to zeros which is unnecessary in your case. Assuming a good compiler, that should pretty much end up with decent machine assembly, which will then run faster than your current algorithm. Of course, you could also write it in assembly directly (or embedded in C using asm blocks), but that's probably a bit extreme for what you want Then access the entire routine via JNI (which is slow, so only suitable if you don't need to make many native method calls).

Without knowing what you're doing with such large numbers of bits though, it's hard to say more specifically how you could really improve it.
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2969
    
    9
It sounds like we're interpreting the situation differently here. I thought that a single BitSet has 16000 bits. I can create one of those and serialize it, and the result has a size of 2072 bytes - pretty much as expected.

So I wonder if maybe Varun is doing something different than what I have understood.
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
It sounds like we're interpreting the situation differently here. I thought that a single BitSet has 16000 bits. I can create one of those and serialize it, and the result has a size of 2072 bytes
I don't think we disagree---isn't that what I said or was I unclear? Or am I misunderstanding you?!

It appears in your case the serialization overheads are 72 bytes per BitSet. But he says he has 9000 of those objects. So 2000*9000 = 18MB I mentioned, but you also have the serialization overheads of 72*9000=648KB. I was merely suggesting you can do away with the extra 648kB and the time taken to de-serialise by using a custom file format and import/export procedure. This is pretty much raw data though, so not much work to do there...

I still can't see why 9000 x 16000 bits are needed... I don't think I've ever come across managing and doing logic on that many bits individually before!
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2969
    
    9
Ah, I see now that Varun edited his post in between when I first read it and when you responded. I was going to work during that time, and when I read your replies, I didn't reread the original. I believe Varun's original message didn't say anything about 9000 objects, so I didn't see where you were getting that. I should have taken more time. Anyway, I have no disagreement with what you've said, now. Sorry for the mix-up.
Varun Chopra
Ranch Hand

Joined: Jul 10, 2008
Posts: 211
Thanks all friends for reply. My apologies because my numbers are not exact. Actually the size of each bitset object is little less than 16000 and number of bitset objects are also 89xx. Size of 16MB is also not exact (but around).

My thinking was to pack these long bitset objects into something smaller (like BigInteger) and then convert them back to binaries based on need. So I am looking for any other idea on similar lines which will be further better.

As far as what I am doing with these bitmaps is concerned, this application has to search products matching an attribute value (among thousands of products). For e.g. If there are 15000 Cars and some of the Cars can have MP3 players (of different brands) and some not, and we have to search a car which has mp3 player of Kenwood fitted in it, then instead of firing an SQL query - which can be really slow (app has AJAX driven UI so has to be responsive), we can do bitwise operations to get the results fast. This works really really well for size under 10000 products. Bit when number of products under a category go above 10000, bitmaps become huge and they take a lot of RAM, slowing down the system.

Hope this gives an idea of having big bitmaps.

[ November 09, 2008: Message edited by: Varun Chopra ]
[ November 09, 2008: Message edited by: Varun Chopra ]
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4634
    
    5

BitSet indicies work most effectively when they are fairly dense, 25 to 50% of the cases on. So using them for a Sex field, either M or F works well for the general population, but does not work well for selected populations, say US professional football players, since they are all men.

In your sample case, most cars have music players/radios/CD, but I would expect only a small percent to have a Kenwood CD player. So bit sets are less optimal. a B-star tree of values that match, storing the relative record id, or perhaps the SQL record id that contains all the information, it better suited.

You need to match your storage and in-memory representation to the specifics of both the field itself, and typical queries that you want to optimize
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15632
    
  15

Working extensively with bits is far enough out of the mainstream of what Java is designed for that I'd probably consider defining some JNI to permit use of whatever special bit-level instructions the target machine might offer instead of trying to shoehorn them in via the standard Java functions. On sets this large, anyway. If you were only talking a few dozen bits, I would/have/am-presently using pure Java code to manipulate bitsets.

But when you have this much data to manipulate, you may gain substantial performance/storage benefits by escaping into native mode, depending on how well Java can handle your needs [i]vs.[/] how well the native bit-manipulation instructions can.

Depending on your data, the persistent copy may benefit from some sort of compression technique. On a long-ago study, a company I worked for found that they could save both disk space AND CPU time by compressing data. The overhead of the compression was offset by the fact that the data-moving code had so much less data to move.

Compression isn't for everything. Every compression algorithm has a worst-case mode where you end up with more bits than you started with. So you need to select a compression algorithm that's a good fit for your data. This could be simple run-length encoding or something complex like Lempel-Ziv.


Customer surveys are for companies who didn't pay proper attention to begin with.
Bill Shirley
Ranch Hand

Joined: Nov 08, 2007
Posts: 457
Originally posted by Tim Holloway:
Depending on your data, the persistent copy may benefit from some sort of compression technique. On a long-ago study, a company I worked for found that they could save both disk space AND CPU time by compressing data. The overhead of the compression was offset by the fact that the data-moving code had so much less data to move.


Absolutely.
Disc access is by far the bottleneck.

Using the ZipOutputStream/ZipInputStream, or something else from the java.util.zip package should improve performance and storage.


Bill Shirley - bshirley - frazerbilt.com
if (Posts < 30) you.read( JavaRanchFAQ);
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4634
    
    5

Originally posted by Tim Holloway:
WThis could be simple run-length encoding or something complex like Lempel-Ziv.


LZ/LZW is a really bad fit for this application. And run length encoding only works if the bit density is really high. I'd stay away from either of them, completely.

I don't agree that JNI is needed. But it may be worthwhile to just use a bit vector
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 2969
    
    9
[Pat]: And run length encoding only works if the bit density is really high.

Or really low, yes?
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 15632
    
  15

Originally posted by Pat Farrell:


LZ/LZW is a really bad fit for this application. And run length encoding only works if the bit density is really high. I'd stay away from either of them, completely.

I don't agree that JNI is needed. But it may be worthwhile to just use a bit vector


Since I don't have any metrics for the specific data or operations in question, I can't say. But it's worth a try.
Pat Farrell
Rancher

Joined: Aug 11, 2007
Posts: 4634
    
    5

Originally posted by Mike Simmons:
[Pat]: And run length encoding only works if the bit density is really high.

Or really low, yes?


Yes, of course.
Varun Chopra
Ranch Hand

Joined: Jul 10, 2008
Posts: 211
Thanks all friends, sooner or later I will have to consider your suggestions to try some other technique for good performance - as the data and number of users grow on our site. Thanks again.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: BitSet Serialization - Any idea to save disk space and better performance
 
Similar Threads
BitSet class
Bitset Operations - any alternatives to this code?
Bitsets and memory consumption
Removing from a collection
Cloning Question