aspose file tools*
The moose likes Java in General and the fly likes Question strings and character encodings Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Question strings and character encodings" Watch "Question strings and character encodings" New topic
Author

Question strings and character encodings

Robin Dee
Greenhorn

Joined: Mar 25, 2010
Posts: 7
Hello all,

I'm quite new to Java (not to programming though) and I've run into some kind of problem in a tiny app I wrote. I hope I can get some help here :-).

What's going on? Using Apache's PDFbox I extract some text from PDF files. After extracting text from an PDF, I MD5 the text and store that in a database. That works just fine. Except... in some special cases. If the characters encountered are non-ASCII characters, the outcome of the MD5 hashing is different when I run my Java app on a Linux or a Windows system. My guess would be the difference in character encodings used by Linux and Windows.

What would be a good way to solve this issue? Can I force my string to be converted into some specific encoding (LATIN-1 for example) before applying the MD5 hash in order to guarantee identical results on Windows and Linux?

Best, Robin
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18889
    
    8

When you say "the outcome of the MD5 hashing", are you comparing the array of bytes which is the outcome? Or are you converting that array of bytes to a String?

If it's the latter, then don't do that. You should only convert bytes to a String if they represent text. And an MD5 hash doesn't represent text.

And I think you have the string encoding concept backwards. You said
Can I force my string to be converted into some specific encoding

But it's the array of bytes which is in some encoding. A String is never in any encoding, since it just represents a sequence of Unicode code points. You can certainly encode a String into an array of bytes using any encoding you like. (But Latin-1 would be a bad choice, since it can't represent all Unicode characters.)
Robin Dee
Greenhorn

Joined: Mar 25, 2010
Posts: 7
Paul Clapham wrote:When you say "the outcome of the MD5 hashing", are you comparing the array of bytes which is the outcome? Or are you converting that array of bytes to a String?

If it's the latter, then don't do that. You should only convert bytes to a String if they represent text. And an MD5 hash doesn't represent text.


Hi Paul,

I'm using only strings; I store the output of the PDF text extraction in a string-typed variable and both the input and output of the MD5 hash are strings. The output should be a string (I don't see any harm in that, as md5 hashes only contain ASCII chars, right?).

Thanks!
Jelle Klap
Bartender

Joined: Mar 10, 2008
Posts: 1822
    
    7

This is commonly solved by obtaining a byte representation of the (usually password) String in a fixed encoding (e.g. UTF-8), applying the one-way hashing algorithm of your choosing to that byte sequence to obtain the digest, applying BASE64 encoding to that digest and storing it in the database in US-ASCII encoding.

Build a man a fire, and he'll be warm for a day. Set a man on fire, and he'll be warm for the rest of his life.
Robin Dee
Greenhorn

Joined: Mar 25, 2010
Posts: 7
Jelle Klap wrote:This is commonly solved by obtaining a byte representation of the (usually password) String in a fixed encoding (e.g. UTF-8), applying the one-way hashing algorithm of your choosing to that byte sequence to obtain the digest, applying BASE64 encoding to that digest and storing it in the database in US-ASCII encoding.


Hi Jelle,

I'd figure I only need to do the first thing: obtain a byte representation of the String and pass that to my MD5 hasher...? As the output of the MD5 hasher is ASCII encoded? Or would there be any good reason to convert the digest to BASE64 and store that?

Best,Robin

ps Dutch I presume? ;)
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18889
    
    8

I don't know what MD5 hasher you are using. The result of an MD5 hash is a 16-byte array. Not ASCII. However you may be using some option which converts that array to its representation as 32 hexadecimal characters, in which case that is ASCII.
Jelle Klap
Bartender

Joined: Mar 10, 2008
Posts: 1822
    
    7

An MD5 digest has a fixed length of 128 bits, which is typically returned by the digester as an array of 16 bytes.
No character encoding applied, which is where the BASE64 encoder comes in.
And yes, I am Dutch

Edit: Ugh, too slow.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39818
    
  28
Too difficult for "beginning". Moving thread.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Question strings and character encodings