• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Unable to correctly read UTF-8

 
Ranch Hand
Posts: 209
13
VI Editor
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
To experiment with UTF-8, I have a file 'testfile.utf8'  consisting of hex:
41 C2 A3 E0 A4 85 F0 90 84 B7



The file consists of 4 characters of length 1 byte, 2 bytes, 3 bytes & 4 bytes.

hexcodepointcharacter
41U+0041LATIN CAPITAL LETTER A
C2 A3U+00A3POUND SIGN
E0 A4 85U+0905DEVANAGARI LETTER A
F0 90 84 B7U+10137AEGEAN WEIGHT BASE UNIT


I found this page UTF-8 table handy for checking these.

To read the file & display its characters I wrote the following


The first 3 characters get correctly displayed.


but the 4th character, which should be occupying 4 bytes isn't getting read as a single character, seemingly being mistakenly read as two.

Could anyone tell me what I'm doing wrong?
 
Saloon Keeper
Posts: 10705
86
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Your hex print could have been simplified to


Have you tried to read the 10 bytes of your file as bytes and printing them out to see if they are what you think? Or use a hexdump utility?
 
Richard Hayward
Ranch Hand
Posts: 209
13
VI Editor
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Carey Brown wrote:Your hex print could have been simplified to


Thanks, that makes the code simpler. I was unaware of that formatting conversion.

Carey Brown wrote:
Have you tried to read the 10 bytes of your file as bytes and printing them out to see if they are what you think? Or use a hexdump utility?


Yes, my first screenshot showed the output from the  linux xxd command.
Plus, I was working with the file in a hex editor.

 
Carey Brown
Saloon Keeper
Posts: 10705
86
Eclipse IDE Firefox Browser MySQL Database VI Editor Java Windows ChatGPT
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
A partial clue.
The four hex values are the same as if the character (d800dd37) was read in with UTF-16. Curious, have you tried reading it in as UTF-16  just to see?
 
Rancher
Posts: 5008
38
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How would the last char held in 4 bytes that maps to 21 bits fit in a single unicode character?

Did you map the bits for the first three characters?  Did they map correctly?
I used the mapping from: https://en.wikipedia.org/wiki/UTF-8
 
Richard Hayward
Ranch Hand
Posts: 209
13
VI Editor
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Norm Radder wrote:How would the last char held in 4 bytes that maps to 21 bits fit in a single unicode character?


From wikipedia, unicode characters with code points in the range U+10000 -> U+10FFFF are held in 4 bytes:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The last 4 bytes in my file are, in both hex & binary format

F09084B7
1111 00001001 00001000 01001011 0111
1111 0xxx10xx xxxx10xx xxxx10xx xxxx

So, the 21 bits marked x correspond to
000010000000100110111 = 10137 (hex)

It's U+10137 that I was expecting to read from the file, in those 4 bytes.

Or is that not what you were asking?

Actually, the last code point for UTF-8 4 byte characters is given at wikipedia as U+10FFFF. A tutorial on youtube gives the last code point as U+1FFFFF. Not sure yet which is correct, but I don't think that matter has a bearing on my problem.
 
Norm Radder
Rancher
Posts: 5008
38
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
A unicode character holds 16 bits.  How would the 21 bits from the 4 bytes be placed in the 16 bit unicode char?

00001 0000 0001 0011 0111 = 1 0137 (hex)  

How is the leading 1 held?  How does unicode specify that there are two char values needed to hold the one char that came from the 4 bytes utf8?
 
Richard Hayward
Ranch Hand
Posts: 209
13
VI Editor
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Norm Radder wrote:A unicode character holds 16 bits.


I don't think that's true in the case of UTF-8 which is a variable length encoding.
The letter A,  code point = U+0041 for example, only needs a single byte.

 
Norm Radder
Rancher
Posts: 5008
38
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I think unicode chars use 16 bits/2 bytes.  What happens when the char requires more bits like the 4 byte UTF8 char?

Read the API doc for the Character class.
 
Richard Hayward
Ranch Hand
Posts: 209
13
VI Editor
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Norm Radder wrote:What happens when the char requires more bits like the 4 byte UTF8 char?


The leading 1111 0 bits of the first byte indicate that the character is going to use 4 bytes.
The leading 10 bits of the following 3 bytes indicate that it's a continuation.

Hence the notation
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

I think such a system can continue up to a length of 6 bytes.
youtube tutorial
 
Richard Hayward
Ranch Hand
Posts: 209
13
VI Editor
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Norm Radder wrote:I think unicode chars use 16 bits/2 bytes..



Ah, the java char datatype is 16 bit!

I get it.
Norm & Carey, thanks to both of you for your help!
 
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Richard Hayward wrote:. . . the java char datatype is 16 bit! . . .

I believe that Java® Strings default to an encoding called UTF-16. Not certain however.
 
Saloon Keeper
Posts: 15510
363
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes, chars in Java are ALWAYS UTF-16 Big Endian.

To output Java Strings with a specific encoding, you need to use a writer that's configured to use that encoding.
 
reply
    Bookmark Topic Watch Topic
  • New Topic