• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Devaka Cooray
  • Ron McLeod
  • Jeanne Boyarsky
Sheriffs:
  • Liutauras Vilda
  • paul wheaton
  • Junilu Lacar
Saloon Keepers:
  • Tim Moores
  • Stephan van Hulst
  • Piet Souris
  • Carey Brown
  • Tim Holloway
Bartenders:
  • Martijn Verburg
  • Frits Walraven
  • Himai Minh

A strange unicode String literal problem

 
Ranch Hand
Posts: 161
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
1. If I print out the following unicode in a servlet--

String s = "\u1f26\u0323\u1f82";
out.println(s);

The unicode Greek characters are printed perfectly in HTML.

2. But when I get the unicode String (u1f26\u0323\u1f82) from elsewhere,
that is, I DO NOT initialize String s with the literal string as above,

out.println statement produces, on the HTML page, the
literal string -- \u1f26\u0323\u1f82

In the case of 2, the Unicode code-point values are not "parsed."

Why is this unicode not parsed?

Many thanks in advance!
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
How are you doing this operation:

But when I get the unicode String (u1f26\u0323\u1f82) from elsewhere

??
If you are using a Reader, I would expect it to do the transformation.
Bill
 
Benjamin Weaver
Ranch Hand
Posts: 161
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Bill,

Thanks for taking a shot. Below is some code that drives the point home. Notice the line commented out, in which the String unicode is initialized with a literal string. If uncommented (and the following line commented out) the unicode in utf-8 will be stored perfectly in the file, foo.txt, and displayed correctly when read back from the file. But if the unicode String is read in as a string from a conversion routine(the string is correct), the string is written to the file as the literal unicode sequences, not as UTF-8, and displayed, when read back from the file, as literal sequences.

So, in this (heuristic) example, BufferedWriter does not convert the sequences.

[ July 05, 2004: Message edited by: Jim Yingst ]
 
Benjamin Weaver
Ranch Hand
Posts: 161
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Made a mistake in that example code. In the actual servlet I write to the html page using a PrintWriter, not a BufferedWriter as indicated here.
 
William Brogden
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
When you have this text in your program:
String unicode = "\u1f82\u1f26\u1f82\u1f26\u1f82\u1f26\u1f82\u1f26";
The conversion is done by the Reader that the compiler uses to read the source code file. Therefore it is not surprising that your translateToUnicode method does not create the same thing.

Exactly what does that method do? Are you using literal unicode characters, or what?

BufferedWriter certainly does not convert "\uXXXX" - that is a Reader's job.

Bill
 
Benjamin Weaver
Ranch Hand
Posts: 161
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Bill,

We're getting closer to the answer, I think. Here's what the translateToUnicode() method does:

1. returns a string of literal unicode sequences, e.g. \u1f82\u1f26\u1f82\u1f26\u1f82\u1f26\u1f82\u1f26

2. #1 is the important fact, but I will explain what the method does. In order to do #1, translateToUnicode() converts a "Betacode" representation of ancient Greek into the unicode character string. Betacode enables users with primitive browsers to input Greek text using Latin ascii characters. For example, a Greek letter "alpha" with an accent mark over it is written, in Betacode "A/". This Betacode has a single or double character unicode equivalent depending on the scheme of unicode normalization. In the normalization scheme we are using, "A/" maps to a single unicode character: \u03AC. The Latin characters input into a textarea on the browser are converted into a unicode sequence of the kind cited above and either stored in a database or returned to the user in a separate html page or in an Applet JTextArea.

I have verified that translateToUnicode() returns a correct unicode sequence. The implementation of this method does not actually write to, then read from, a file--I included that code simply to highlight the conversion problem.


The problem to be solved is how to get Java to convert the unicode sequence (e.g. \u1f26\u1f82\u1f26\u1f82\u1f26 )
into displayable characters, preferably in UTF-8 encoding.
 
Warning! Way too comfortable! Do not sit! Try reading this tiny ad instead:
the value of filler advertising in 2021
https://coderanch.com/t/730886/filler-advertising
reply
    Bookmark Topic Watch Topic
  • New Topic