Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Encoding problem when writing to file system

 
sven coleman
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I'm generating a String containing some HTML that contains many accents. When I try to save it to file, the accents and cyilic characters are mangled. I would like to be able to save a file like this one:

http://www.columbia.edu/kermit/utf8.html


I've tried various encodings: MacRoman (i'm in front of a Mac for now) and the UTF8 encodings for the OutputStreamWriter without any success.

in String -> Study after Vel�zquez

in file -> Vel�zquez (MacRoman) Vel��zquez (UTF8)

From what i understood, the Mac filesystem is not UTF8 based so i just can't ouput correctly that kind of characters...

I guess there is a big issue i'm missing here but is it possible to/how could I produce some UTF-8 that could be correctly rendered on Linux/Windows/Mac?



Thanks
[ November 02, 2006: Message edited by: sven coleman ]
 
Joe Ess
Bartender
Pie
Posts: 9258
10
Linux Mac OS X Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
How are you viewing the file? Are you certain that the application can properly render UTF-8?
 
sven coleman
Greenhorn
Posts: 2
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Joe,

Thanks a lot for your answer. I'm new to these encoding problems and I'm lost with it...
As my end target is a browser, I've tried to see the file in Firefox as rendered HTML and as source. I get the same hieroglyphs.
If i look at it using JEdit, the same characters appear. I guess (?) at least JEdit is using system default encoding (MacRoman). As for Firefox, i guess (???) it should be able to interpret it correctly if the file was really in UTF-8...

As a attempt to find a workaround, I've succesfully converted my messy HTML in XHTML using TagSoup. As XML is mainly UTF-8 encoded, I naively thought 3rd party XML libraries could handle it for me. Special characters look great in Eclipse console but when i save the file using the DOM4J XMLWriter:

org.dom4j.io.OutputFormat format = org.dom4j.io.OutputFormat.createCompactFormat();
format.setEncoding("UTF-8");
format.setNewlines(true);
format.setIndentSize(2);
format.setTrimText(false);

org.dom4j.io.XMLWriter xmlWriter = new org.dom4j.io.XMLWriter(new
FileWriter("/mydir/tagSoup2.html"), format);
xmlWriter.write(docXHtml);
xmlWriter.flush();

xmlWriter.close();
}

I get ? in place of the accents...
I'm on Mac, the file systems encoding isn't UTF-8... i read it could be the problem...
Would it help if i ran the code on Windows or Linux?

Thanks
[ November 02, 2006: Message edited by: sven coleman ]
 
alban maillere
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hello sven,
i'm not used to mac systems but i can tell you I usually resolve all the accents problems (for european languages) by using ISO-8859-15 (or ISO-8859-1 if the first is not supported)

Hope it helps
 
Vlado Zajac
Ranch Hand
Posts: 245
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Filesystem support is only needed for file names. For file data, support in target program is needed. Any modern other browser support utf-8.

But the program (browser) must know the encoding of file somehow.
In HTML, encoding is specified this way.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic