• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Problem with writing a String containing both English and Non-English chars into a text file.

 
Ranch Hand
Posts: 327
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All,

I am on Windows XP OS, latest Java SDK and latest Eclipse.

I am reading a string from an input file, which contains both English and non-English characters. (I have installed on Windows the required support for the other language, so I can read it on Windows apps such as notepad).

While debugging with Eclipse, I see that the string seems to contain the data correctly: both English and non-English chars seems to be Okay.

Now I am converting the string to byte array and write it to the output file. But then when I open that file with notepad, I see that all non-English chars were converted to question marks.

What might be the problem???

Here is the relevant code:



Jericho HTML Parser

 
Sheriff
Posts: 22783
131
Eclipse IDE Spring VI Editor Chrome Java Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I suggest you try with a proper editor first. Notepad is barely worthy of the name "text editor". In fact, it is so limited (e.g. encodings, line breaks, file size) you can barely call it a program. Notepad++ or PSPad are both free and are said to be quite good.

If you still see problems then check the encoding you are using when writing to the file. Check out String.getBytes(String) or String.getBytes(Charset)
 
Joseph Sweet
Ranch Hand
Posts: 327
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks for the idea of String.getBytes(Charset), it worked out this time.
But I do not understand the concept of the pertinence of Charset while converting a String into Bytes.... Why does it matter according to which Charset the string is expected to be read.....? After all am I not just taking every byte in the String and pushing it to the next place in the Byte array.......? I can't see what it has to do with the Charset with which I would later want to construe the Byte array.

P.S. I do have PSPad editor, but for some reason it sometimes collapses with large files.
 
Ranch Hand
Posts: 423
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Joseph Sweet wrote:
But I do not understand the concept of the pertinence of Charset while converting a String into Bytes....


Each char in the string is two-bytes unicode character.
Charset provides rules how to map char values (two bytes - 65535 possible values) into the byte values (one byte - 256 possible values).

look at this example - conversion of polish 'ł' char using different charsets:

result is:

polish character 'ł' is supported only by encodings windows-1250 and ISO-8859-2 - and for this two encodings conversion works fine.
The others give strange results.
 
reply
    Bookmark Topic Watch Topic
  • New Topic