Win a copy of Terraform in Action this week in the Cloud forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • Rob Spoor
  • Bear Bibeault
Saloon Keepers:
  • Jesse Silverman
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Al Hobbs
  • salvin francis

Confuse in UTF and Octal encoding

 
Ranch Hand
Posts: 658
2
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Ranchers,
I was playing with codes and came across some confusion.





When i studied for it, I came to know that java uses UTF-16 for java source code encoding. But I am unable to relate this to my issue.
If anyone can also provide some good resource for such knowledge, that would b great for me.

Thanks
 
Rancher
Posts: 4801
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The first one is because the compiler interprets any integer value as an int.

The second two are because that's how an octal is defined, and escape character followed by an integer up to (I think) 255.
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Puspender Tanwar wrote:When i studied for it, I came to know that java uses UTF-16 for java source code encoding. But I am unable to relate this to my issue.
If anyone can also provide some good resource for such knowledge, that would b great for me.


Well, the first one I can think of is the Character class itself.

Characters - and especially character encodings - are NOT simple; and there's a lot of history behind them.
Originally, computers only dealt with the English alphabet (52 characters), control codes, and and a few other common symbols like '/', '-' and '*', because they fit in a very small space (7 bits), which in turn fits nicely into a byte (8 bits). And for ages, there were two basic standards for encoding: ASCII (the 'A' standing for 'American') - used by a lot of early Unixes and (I think) DEC - and EBCDIC, which was used by IBM and ICL.

However, over time, especially with the advent of desktop systems, people wanted to see their own languages - French, German, Spanish, etc - represented, and these have a lot of diacritics or "accents" that English doesn't - 'é' is not the same thing as 'e' in French. Then there are the Greek and Cyrillic alphabets; and when you get to pictogram alphabets like Chinese, there are about 3,000 commonly used symbols.

Obviously, 8 bits can't cope with those sorts of numbers so, by the time Java became a reality, there was already a standard in place called Unicode, which used 16 bits (≈65,000 values) to cover most of the world's alphabets, and this was the one that Java opted for - which is why Java characters are normally TWO bytes (16 bits) long.

Problem is, with the advent of browsers, and HTML, even 16 bits isn't enough, so Unicode was extended to allow even bigger values, which is what UTF-8 and UTF-16 are all about.
It's possibly also worth mentioning that the 'TF' stands for "transmission format", since it's a format - or encoding - for streams of bytes (8) or characters (16) that you might receive from a file, or over a network or socket.

None of which gets you any closer, but hopefully provides a bit of background.
My advice: Google some things like "Unicode", "character encoding", "UTF-8" and "UTF-16", and concentrate on the Wikipedia articles, because they're generally very good.

HIH

Winston
 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If your question is why the character code is in octal rather than decimal, then that is most likely because this way of escaping characters is a feature that Java inherited from C, which has had this feature for decades. And because someone in the 1970's thought that having octal character escape codes was useful, for some reason that nobody knows anymore.

edit - the relevant section in the Java Language Specification indeed says that this comes from C:

JLS wrote:
Octal escapes are provided for compatibility with C, but can express only Unicode values \u0000 through \u00FF, so Unicode escapes are usually preferred.


 
Marshal
Posts: 74387
334
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Dave Tolls wrote:. . . escape character followed by an integer up to (I think) 255.

377, surely? It shoul‍d be in the Java® Language Specification (=JLS).

Why are you using octal arithmetic in the first place? It has hardly been used for ages. Even that JLS link says to prefer \u1234 escapes. As Jesper says, the JLS says octal escapes are there for compatibility with older C‑like languages. I presume octal escapes were useful for characters like ß (=0x00df) or » (=0x00bb) which are included in the 0...255 range, but weren't available on many keyboards.
 
Dave Tolls
Rancher
Posts: 4801
50
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Campbell Ritchie wrote:

Dave Tolls wrote:. . . escape character followed by an integer up to (I think) 255.

377, surely? It shoul‍d be in the Java® Language Specification (=JLS).



I was working in decimal...

;)
 
Puspender Tanwar
Ranch Hand
Posts: 658
2
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thank you all.
Now I understood the point and will research over the wiki pages for some deep insight. But what I noticed is that I can only be able to print upto \u00FF only. Beyond that for every unicode value, output is '?' . Why I am not able be to print beyond \uooFF ?
 
Puspender Tanwar
Ranch Hand
Posts: 658
2
Java
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
problem solved : in eclipse go to Windows -> preferences -> general -> workspace and under text file encoding select UTF-16 or UTF-8.
 
Campbell Ritchie
Marshal
Posts: 74387
334
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Puspender Tanwar wrote:. . . . But what I noticed is that I can only be able to print upto \u00FF only. Beyond that for every unicode value, output is '?' . . . .

Where are you printing? The Windows® command line is notorious for being unable to render characters > 0x00ff, and even some “extended ASCII” characters come out oddly. For example £ is 0x00a3 but renders as ú on the command line. It has to do with the encoding used, which isn't Unicode but something beginning cp. If you haven't found anything really good to read about encodings, try Joel Spolsky.
 
Puspender Tanwar
Ranch Hand
Posts: 658
2
Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks Campbell. That's really a helpful link. Excellent explaination for a beginner.
But I have some doubts here,

That's where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let's just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn't it also be:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode



How 00 48 can be same as 48 00 as stated above? As per my knowledge these two are different Hex-numbers.
Next is, as told in the blog that in unicode encoding 2 byte are used for storing a code point. Please correct me if I am wrong, 0048 is stored in 2 bytes, right ? Means 00 is covering 1 byte and 48 is covering another byte, right ?
 
Marshal
Posts: 26914
82
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Puspender Tanwar wrote:How 00 48 can be same as 48 00 as stated above? As per my knowledge these two are different Hex-numbers.



But if one of them is in a big-endian representation and the other is in a little-endian representation, then they represent the same value. (You might want to google those terms to find out the distinction they are describing.)
 
You showed up just in time for the waffles! And this tiny ad:
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop
https://coderanch.com/wiki/718759/books/Building-World-Backyard-Paul-Wheaton
reply
    Bookmark Topic Watch Topic
  • New Topic