GeeCON Prague 2014*
The moose likes Java in General and the fly likes Java endian and bit order and signed and unsigned and python and more confusion than I can deal with Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Java in General
Bookmark "Java endian and bit order and signed and unsigned and python and more confusion than I can deal with" Watch "Java endian and bit order and signed and unsigned and python and more confusion than I can deal with" New topic
Author

Java endian and bit order and signed and unsigned and python and more confusion than I can deal with

tony roberts
Greenhorn

Joined: Jan 23, 2011
Posts: 9
Hello all

Please forgive the title of this post but I'm completely out of ideas and I'm on the brink of smashing my laptop and then eating it.

I'm currently working on my university final year project and I'm on the last past of what I need to do, which is to decompress some data.

In order to do this I need to read in some bytes and then obtain a bitmask and then an length and offset. I wont bore you with the exact details.

Now what I'm doing has been done before but I want to do it myself in Java rather than relying on existing code.

Now, I'm using some Python and C code as a reference but I'm completely stuck on how to do this last bit.

Here is what I'm looking at. Essentially I need to know how to do this in Java.

This the C code:



And this is the Python code:



Now I've tried all kinds of different combinations of byte endian order and shifting and messing about and I'm going round in circles. I know the '<L' means 'Little endian Unsigned Long' and the <H is 'Little Endian - Unsigned Short'

Now I know Java doesn't have unsigned primitives so I'll have to bit shift or somehow convert the byte data into something that Java can deal with but I can't for the live of me figure it out.

So any advice would be super helpful and I might just stop me from tearing my hair out.

Thanks in advance

Tony
Hauke Ingmar Schmidt
Rancher

Joined: Nov 18, 2008
Posts: 433
    
    2
Why does it matter how the bits are interpreted (signed, unsigned) when you are only working with the bits and bitmasks?

The C code could work as is, if I interpret this correctly.
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3647
    
  16

Like Hauke said, you don't have to worry about the value being signed or not if it's used as a bitmask. Otherwise, you can simply useto preserve the unsigned value.

Endianness could be an issue though. You can either use ByteBuffer to change the order of the bytes around, or you can shift them manually. What have you tried so far?
Ralph Cook
Ranch Hand

Joined: May 29, 2005
Posts: 479
tony roberts wrote:
Here is what I'm looking at. Essentially I need to know how to do this in Java.

This the C code:



Now I know Java doesn't have unsigned primitives so I'll have to bit shift or somehow convert the byte data into something that Java can deal with but I can't for the live of me figure it out.

So any advice would be super helpful and I might just stop me from tearing my hair out.

Thanks in advance

Tony


I find it helpful in these situations to break down the expression I want into component pieces and use a debugger to look at the result of each one. I wrote the following:



and discovered that the operation "m = m | x", if x is a negative byte, will set all bits to the 'left' of x to 1 in the integer result, m.

I understand you are in a hurry, so this is what I threw together:



The idea is that, after the temporary integer is set to the value of the shifted byte, you mask out all the bits that will have been set if the byte is negative, and only then do you 'or' it in to your result integer.

I hope this helps -- if it does not, perhaps it will give you enough info to phrase a more specific question.

rc
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3647
    
  16

Why not:

I quickly copied this from one of my hobby projects. It should be easy enough to edit it.

[edit]
After slightly less hasty review of the above code, they are the same, except this code uses a single statement. I guess it's a matter of preference, brevity or re-usability.

[edit2 (ffs I need to pay more attention)]
The above code uses an extra shift operation per byte, which can be avoided.
tony roberts
Greenhorn

Joined: Jan 23, 2011
Posts: 9
Thank you everyone for all your help.

I'm still getting myself super confused.

The problem I'm having is tying all my code together.

I guess it would help if I put down exactly what I need to do and then compare your answers to what I currently have.

I have a byte[] containing approx 64k of compressed data.

The first four bytes need to be read in as a 32-bit bitmask (I'm not sure of the endian, problem one)

I've also created a byteBuffer which is a wrapper for the byte[] -



Then I shift through the bitmask to check if the bit is a zero or one. -



Indicator bit starts at 32 then gets decremented by one each run through a loop ends up at zero and then reads in another 4 byte bitmask.

If a zero is detected in the bitmask then the corresponding byte in the inputbuffer gets copied into an output buffer and the pointer locations get moved along one byte:



If a one is read the the next two bytes will be metadata, containg an offset and a length. So I read in the next two bytes into a seperate byte[]:

;

Now, this is another added level of confusion, the offset is the first 13 bits of the two bytes and the length is the last three. So I've written a function to calculate this:



In the small amount of documentation of I have relating to the compression algorithm in use, it states:

For example, the metadata 0x0018 is converted into the offset b'000000000011', and the length b'000'. The offset is '-4', computed by inverting the offset bits, treating the result as a 2's complement, and converting it to an integer.


Now Java already treats ints as signed, which I believe is already 2's complement.

So what I need to do is work out the best way of reading in four bytes, putting them all together into an int for the bitmask and then shifting through looking for zero or one.

If zero, copy the inputbuffer byte to the outputbuffer in the next position.

If One, read in two bytes and then take the first 13 bits for the offset and the last three (LSB) for the length.

I think I've got all my code in place but my program keeps going outside of my output array due to the offset being wrong.

Sorry for the long post but I thought this might clarify exactly what I'm trying to do.

Thanks again

Tony
Ralph Cook
Ranch Hand

Joined: May 29, 2005
Posts: 479
tony roberts wrote:Thank you everyone for all your help.

I'm still getting myself super confused.

The problem I'm having is tying all my code together.

I guess it would help if I put down exactly what I need to do and then compare your answers to what I currently have.

I have a byte[] containing approx 64k of compressed data.

The first four bytes need to be read in as a 32-bit bitmask (I'm not sure of the endian, problem one)

From what you've written below, I would call this a bit array, not a mask; though I'm sure some people do call it a mask, I think it technically means a value that is used in 'and' or 'or' operations to mask bits in an integer in or out of a result. The value with the single bit that you use to determine whether one of the bits is 1 or 0 -- THAT's a mask. But this isn't significant to your problem, except in terms of making yourself/ourselves understood.
tony roberts wrote:

I've also created a byteBuffer which is a wrapper for the byte[] -



I did not know this feature existed -- thanks for leading me to it.

Reading the documentation for ByteBuffer, it seems to me you might be expecting it to do something it isn't going to do. From what I read, the method getInt() on that class will return an integer, reading the bytes in little or big-endian, as specified. But reading through things a byte at a time is not going to change according to the endian setting, because endian refers to the order of the bytes, and if you're only getting one, then it is just going to return the next one regardless of the endian setting. And if all you're going to do with the ByteBuffer is put it into a byte array, I don't think the endian setting of the byte buffer has any effect.
tony roberts wrote:

Then I shift through the bitmask to check if the bit is a zero or one. -



Indicator bit starts at 32 then gets decremented by one each run through a loop ends up at zero and then reads in another 4 byte bitmask.
we'll assume you mean bit position 32
tony roberts wrote:

If a zero is detected in the bitmask then the corresponding byte in the inputbuffer gets copied into an output buffer and the pointer locations get moved along one byte:



If a one is read the the next two bytes will be metadata, containg an offset and a length. So I read in the next two bytes into a seperate byte[]:

;
if the data are written so that this is a short int, then here is case where the setting as big or little endian would make a difference, IF you read a short int from the ByteBuffer instead of getting two bytes from a byte array.
tony roberts wrote:
... rest of quote omitted for brevity...


You don't say what the offset and length refer to; I'm assuming it is something outside this data stream. If they refer to bytes within this data stream, you've left that out of your description.

I think you have to figure out the endian-ness of your data. Assuming that all information about the data has been exhausted for the purpose, you might try writing out the first 20-50 bytes of your data and running through the calculations by hand, first as big-endian and then as little-endian to determine which one you have. To do that you'll have to be able to tell whether the values you get are correct, and I don't have enough information here to help you with that. If the length and offset refer to more bytes in the bytestream, then the wrong endian assumption would likely run you off the end of the buffer because a length value would be bigger than the bytes you have left, etc.

If you don't have enough information to go through the calculations manually, then you don't have enough information to write the program, and I can well imagine the frustration in trying. I have not evaluated the code you put above; you have to know what the data is before studying that is any use.

rc
tony roberts
Greenhorn

Joined: Jan 23, 2011
Posts: 9
Hey Ralph

Thanks for your input it was very clear and helpful

I'll provide more information on what I'm trying to do.

I'm trying to decompress data that is compressed Using the LZ77 algorithm. The LZ77 algorithm produces metadata to indicate how to decompress/decode the compressed data. The data that I'm reading also uses something called 'Direct2 encoding' which is used to represent the metadata in a bitmask or bitarray as you said.

The two bytes that are used for metadata contain an offset (first 13 bits) and a length (last 3 bits).

For example, if a zero is read in the bitarray/bitmask then the data in the inputbuffer at that corresponding index is copied into the output buffer. This continues along until a one is read. Two bytes are then read in.

Let's say the metadata generates this: (15, 4). This means go back 15 bytes in the stream and copy 4 bytes from that position to the end of the stream. There is more processing to do if the length is 7 but I wont go into that as it's fairly annoying

The first four bytes of the input stream are the bitmask/bitarray, they are (hex) 00 01 B6 02;

Now if I used the C code to convert them into an int;



I get the value of: -4849408

Now as this is a negative value, Java is storing it as a signed int, which means that when it's read as a bitmask/bit array it determines that there is metadata at the first postion, which might be true, but seems incorrect to me as the outputbuffer will be at postion zero and therefore any offset generated will take it outside of the array, resulting in an ArrayOutOfBounds exception.

I'm comparing the C code to some Python code and it seems that the Python code reads in the first four bytes and converts them to a Unsigned Long in Little endian.

Python also reads in the two bytes of meta data as an Unsigned Short in Little endian.

I hope that makes what I'm trying to do a little bit clearer and if you've got any more ideas please do let me know.

If however I'm just talking rubbish please do tell me too!

Thanks

Tony
Ralph Cook
Ranch Hand

Joined: May 29, 2005
Posts: 479
tony roberts wrote:Hey Ralph

...

Let's say the metadata generates this: (15, 4). This means go back 15 bytes in the stream and copy 4 bytes from that position to the end of the stream. There is more processing to do if the length is 7 but I wont go into that as it's fairly annoying

I do not know what you mean by "go back 15 bytes in the stream".

Is 'back' towards the beginning or towards the end?
Is the stream the same one from which we are reading the bit array and the other bytes?
'copy 4 bytes from that position to the end of the stream' - do you mean 'towards the end of the stream'? If not, are you copying 4 bytes or all the bytes to the end of the stream (since I am not yet sure what 'the stream' is).

tony roberts wrote:
The first four bytes of the input stream are the bitmask/bitarray, they are (hex) 00 01 B6 02;

do you mean the very first bytes of your actual data?

Regardless of which end you start at, these four bytes cannot result in a negative int in java (or C). The most significant bit of either the first or last byte would have to be 1; first byte if big-endian, last byte if little-endian. I think you may have ended up with a negative number through shifting a byte into an int (in java) and having the shift operation carry the byte's sign into the int. I showed that, and corrected for it, in my first reply.

Can you use this data to determine the endianness of your bit array? if the above is big-endian, then you are to copy 15 bytes as they are (8 bits zero, then 7 bits zero in '01') and hit a metadata short int after that; if little endian, your first meta data would be after 6 bytes (6 0 bits in '02'). Can you tell? I think the key to solving your problem lies in determining, first, just how it is the data is supposed to be read. Don't worry about java until you know that.

Endianness might also apply to your metadata; the doc could say to read two bytes, but if they mean it's a short int, then you also must determine whether the most significant byte is first or second.

Just so you know: I'm headed to work, and my opportunities for reading and responding to this are limited until late this afternoon. I will tell you here if I feel like I've helped you all I can; if I don't respond it's probably because my employer would not understand.

rc
tony roberts
Greenhorn

Joined: Jan 23, 2011
Posts: 9
Hey Ralph

Thanks for replying again.

I'm trying to determine exactly what endian the data is stored in but the file I'm reading is not publicly documented to it's difficult to tell.

Essentially I'm trying to perform a forensic analysis of a Windows XP hiberfil.sys file (Windows hibernation file).

The way the compression works is this:

Each compressed block starts with a 32 byte header, four bytes of this header contain an UnsignedInt which indicates how much compressed data is in the block. Once I've read in this UnsignedInt I create a byte[] that is this size and then read in that much data from the file.

This is correct as I've confirmed the size by looking at the bytes in a hex editor.

The decompression algorithm is documented by Microsoft (to some extent) this is what the documentation available here (here states:

To distinguish data from metadata in the compressed byte stream, the data stream begins with a 4-byte bitmask that indicates to the decoder whether the next byte to be processed is data (a "0" value in the bit), or if the next byte (or series of bytes) is metadata (a "1" value in the bit). If a "0" bit is encountered, the next byte in the input stream is the next byte in the output stream. If a "1" bit is encountered, the next byte or series of bytes is metadata that MUST be interpreted further.


At this point the inputbuffer is full of the compressed data. I then create another byte[] called outputBuffer which will hold the data as it is uncompressed/decoded.

So as I've said before, the bitmask/bit array that is the first four bytes of the input stream denotes the location of data and metadata within the inputbuffer.

So if a zero is read at bitmask/bit array position 4, then; outputBuffer[1] = inputBuffer[4]; (As the input buffer is now at position 4 after reading in 0-3 for the bitmask/array)

Now onto the offset, length stuff.

The LZ77 compression algorithm (in this case) looks for String matches that are at least 3 bytes long:

So for example the String ABCABCDEF would produce metadata reading ABC, (3,3), (6,3)

So to hopefully clarify, ACB would be written to the outputBuffer, the offset means go back 3 bytes in the outputBuffer (towards the beginning) and copy 3 bytes to the end of the output stream (going towards the end) then put the OutputBuffer position to the end of the outputBuffer. So at this point the outputBuffer would contain ABCABC.

Then the next offset and length is (6,3) which would result in going back 6 bytes and copying 'ABC' to the end of the outputBuffer.

I'm sorry if that doesn't make a lot of sense, I find it a bit hard to explain.

I know exactly what you mean though and you're right, until I can determine the endian of the data being stored then I've no hope to getting anything to work.

My only clue as to the endian of the data is how the Python code is written as it reads all bytes into Unsigned little endian primtives.

Sorry for writing ever expanding replies but I think I'm losing the plot!

Thanks

Tony
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3647
    
  16

Ralph Cook wrote:I did not know this feature existed -- thanks for leading me to it.

Reading the documentation for ByteBuffer, it seems to me you might be expecting it to do something it isn't going to do. From what I read, the method getInt() on that class will return an integer, reading the bytes in little or big-endian, as specified. But reading through things a byte at a time is not going to change according to the endian setting, because endian refers to the order of the bytes, and if you're only getting one, then it is just going to return the next one regardless of the endian setting. And if all you're going to do with the ByteBuffer is put it into a byte array, I don't think the endian setting of the byte buffer has any effect.

Actually, this is perfectly fine. ByteBuffer has an internal array and when you set the buffer's ByteOrder, from that point on it will do subsequent put and get operations as if the bytes in the array were in that endianness. Setting the ByteOrder doesn't actually make any difference to the underlying data, only how it's interpreted afterwards. The way Tony uses it should work swimmingly.

You can also determine the endianness off the system by using ByteOrder.nativeOrder(). So if the compressed data Tony works with is always created on the system where he runs the program, and it's created by a C program or something (which uses native endianness, unlike Java), he can simply use ByteBuffer.order(ByteOrder.nativeOrder()), and then start getting ints etc. from the buffer.
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3647
    
  16

Tony, I quickly wrote this up, let me know if it helps. Be warned that my class throws EOFException when the end of the file is reached.
tony roberts
Greenhorn

Joined: Jan 23, 2011
Posts: 9
Thank you Stephan!

That's absolutely fantastic, I'll see what I can do with all of that.

I've just tried a couple of methods but I'm still getting array out of bounds

I shall venture onwards!

Thanks again

Tony
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3647
    
  16

Well, can you be more clear on the file format?

From what I understand the file consists of a 4 byte header, and then the compressed data? What does the header describe? I understand some offset and length values are stored there, but you're not very clear on this.
Also, you shouldn't use the term bit mask to refer to these values. A mask is simply a value that 'masks' bits in a bit field, for instance, the 0xffff values in my above code are examples of bit masks.
tony roberts
Greenhorn

Joined: Jan 23, 2011
Posts: 9
Hello Stephan

The reason I am referring to them as bitmasks is because that's what Microsoft call them in the documentation.

I'll try to elaborate on what I'm doing and keep it concise.

Each compressed block is approx 64kb in size.

Each block starts with a 32 byte header:


Directly after the header is the start of the compressed data and the first four bytes are the 'bit-thingy' bit-array that indicates how the first 32 elements in the input stream are to be interpreted.

This is some C code that was put together from pseudo code provided by Microsoft and it what I am basing my decoding program on.



Let me know if it makes anymore sense now?

Thanks

Tony


Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3647
    
  16

Okay, so I looked around on Google and took a look at the LZ77 algorithm as well as the Direct2 encoding, and it makes a lot more sense now. Yes, the value you keep referring to is indeed a bit mask.

I think this is how you can properly read the offset and length. I haven't tested this code, but I think it should give you an idea.
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3647
    
  16

Note that in the last case (length == 25), I'm not sure whether the value that needs to be read is signed or unsigned. The document implies the value is signed, but if it's not, simply replace the statement with length = in.readUnsignedShort() + 3.
tony roberts
Greenhorn

Joined: Jan 23, 2011
Posts: 9
Thanks very much Stephan.

I think I'm 99% of the way there, just one more small problem to overcome.

Just one more little problem to address and I reckon I'm there.

I shall keep you posted.

Tony
Stephan van Hulst
Bartender

Joined: Sep 20, 2010
Posts: 3647
    
  16

I'm curious about one thing. How do you determine the end of the data? If each block consists of a repeated sequence of a bit mask followed by 32 elements, when do you know you have reached the final bit mask?
tony roberts
Greenhorn

Joined: Jan 23, 2011
Posts: 9
I have a loop that continues to read and deal with 'bitmasks' until the end of the input buffer is hit.

So it essentially reads in bitmasks until it's reached the end of the inputbuffer.

I think, I've been messing with the code for so long I've started to forget what loops where

The bitmask must also contain a "1" in the bit following the last encoded element, to indicate the end of the compressed data. For example, given a hypothetical 8-bit bitmask, the string "ABCABCDEF" should be compressed as (0,0)A(0,0)B(0,0)C(3,3)D(0,0)E(0,0)F. Its bitmask would be b'00010001' (0x11). This would indicate three bytes of data, followed by metadata, followed by an additional 3 bytes, finally terminated with a "1" to indicate the end of the stream.

The final end bit is always necessary, even if an additional bitmask has to be allocated. If the string in the above example was "ABCABCDEFG", for example, it would require an additional bitmask. It would begin with the bitmask b'00010000', followed by the compressed data, and followed by another bitmask with a "1" as the next bit to indicate the end of the stream.

tony roberts
Greenhorn

Joined: Jan 23, 2011
Posts: 9
I have spotted one problem Ralph

When I'm using yourl function foldInByte; the test files' first bitmask is: 00 01 b6 02h which gets converted to: 2b60100 which creates the correct int for the bitmask.

However, when I run it on my live file, the first bitmask bytes are: a0 01 30 00h, which foldInByte converts to 3001a0 which is missing off the two extra zeros.

I'm not sure if that's causing a problem but I'm getting more array out of bounds errors on the hiberfil.sys that aren't happening on my test file.

Any ideas?

Thanks

Tony
 
GeeCON Prague 2014
 
subject: Java endian and bit order and signed and unsigned and python and more confusion than I can deal with