Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Problem preserving accented characters when writing text to file

 
Chris Gage
Greenhorn
Posts: 17
Eclipse IDE Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I saved a large Microsoft Word document containing many accented characters, n-dashes, inverted commas, etc as HTML, and then using java code I converted it to XML, and split it up into smaller pieces. I then tried to write these pieces out to the filesystem as ISO-8859-1. Since the original MS Word document is CP1252 which is supposed to be a superset of ISO-8859-1, I though this should be straightforward. But when I reopen these files, all the accented characters, inverted commas etc have been converted to question marks.

I did some intensive googling and found many recommendations as to how to resolve this problem, and tried several of them, and was not able to get any of them to work.

Here is my last iteration (still unsuccessful):



What am I doing wrong?
 
Ramon Anger
Ranch Hand
Posts: 56
Chrome Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Could you please provide a sample string or your getInput() method?
 
Paul Clapham
Sheriff
Pie
Posts: 20769
30
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The problem is that all of those characters you mentioned are not defined in ISO-8859-1, so they are rendered as question marks. You should choose a more suitable encoding. (I would suggest UTF-8.)
 
Chris Gage
Greenhorn
Posts: 17
Eclipse IDE Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That was it, I was so focused on the fact that ISO-8859-1 DOES support the accented characters of most European languages, and as a result I missed the fact that it DOES NOT have the punctuation characters such as proper inverted commas, mdashes, ndashes etc..

Thanks for your help, much appreciated.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic