• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Devaka Cooray
  • Ron McLeod
  • Jeanne Boyarsky
Sheriffs:
  • Liutauras Vilda
  • paul wheaton
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Piet Souris
  • Carey Brown
  • Tim Holloway
Bartenders:
  • Martijn Verburg
  • Frits Walraven
  • Himai Minh

Question strings and character encodings

 
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello all,

I'm quite new to Java (not to programming though) and I've run into some kind of problem in a tiny app I wrote. I hope I can get some help here :-).

What's going on? Using Apache's PDFbox I extract some text from PDF files. After extracting text from an PDF, I MD5 the text and store that in a database. That works just fine. Except... in some special cases. If the characters encountered are non-ASCII characters, the outcome of the MD5 hashing is different when I run my Java app on a Linux or a Windows system. My guess would be the difference in character encodings used by Linux and Windows.

What would be a good way to solve this issue? Can I force my string to be converted into some specific encoding (LATIN-1 for example) before applying the MD5 hash in order to guarantee identical results on Windows and Linux?

Best, Robin
 
Sheriff
Posts: 27456
88
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
When you say "the outcome of the MD5 hashing", are you comparing the array of bytes which is the outcome? Or are you converting that array of bytes to a String?

If it's the latter, then don't do that. You should only convert bytes to a String if they represent text. And an MD5 hash doesn't represent text.

And I think you have the string encoding concept backwards. You said

Can I force my string to be converted into some specific encoding


But it's the array of bytes which is in some encoding. A String is never in any encoding, since it just represents a sequence of Unicode code points. You can certainly encode a String into an array of bytes using any encoding you like. (But Latin-1 would be a bad choice, since it can't represent all Unicode characters.)
 
Robin Dee
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Paul Clapham wrote:When you say "the outcome of the MD5 hashing", are you comparing the array of bytes which is the outcome? Or are you converting that array of bytes to a String?

If it's the latter, then don't do that. You should only convert bytes to a String if they represent text. And an MD5 hash doesn't represent text.



Hi Paul,

I'm using only strings; I store the output of the PDF text extraction in a string-typed variable and both the input and output of the MD5 hash are strings. The output should be a string (I don't see any harm in that, as md5 hashes only contain ASCII chars, right?).

Thanks!
 
Bartender
Posts: 1952
7
Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This is commonly solved by obtaining a byte representation of the (usually password) String in a fixed encoding (e.g. UTF-8), applying the one-way hashing algorithm of your choosing to that byte sequence to obtain the digest, applying BASE64 encoding to that digest and storing it in the database in US-ASCII encoding.
 
Robin Dee
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jelle Klap wrote:This is commonly solved by obtaining a byte representation of the (usually password) String in a fixed encoding (e.g. UTF-8), applying the one-way hashing algorithm of your choosing to that byte sequence to obtain the digest, applying BASE64 encoding to that digest and storing it in the database in US-ASCII encoding.



Hi Jelle,

I'd figure I only need to do the first thing: obtain a byte representation of the String and pass that to my MD5 hasher...? As the output of the MD5 hasher is ASCII encoded? Or would there be any good reason to convert the digest to BASE64 and store that?

Best,Robin

ps Dutch I presume? ;)
 
Paul Clapham
Sheriff
Posts: 27456
88
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I don't know what MD5 hasher you are using. The result of an MD5 hash is a 16-byte array. Not ASCII. However you may be using some option which converts that array to its representation as 32 hexadecimal characters, in which case that is ASCII.
 
Jelle Klap
Bartender
Posts: 1952
7
Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
An MD5 digest has a fixed length of 128 bits, which is typically returned by the digester as an array of 16 bytes.
No character encoding applied, which is where the BASE64 encoder comes in.
And yes, I am Dutch

Edit: Ugh, too slow.
 
Marshal
Posts: 76874
366
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Too difficult for "beginning". Moving thread.
 
LOOK! OVER THERE! (yoink) your tiny ad is now my tiny ad.
the value of filler advertising in 2021
https://coderanch.com/t/730886/filler-advertising
reply
    Bookmark Topic Watch Topic
  • New Topic