File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes FileWriter & UTF-8 Encoding Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCA/OCP Java SE 7 Programmer I & II Study Guide this week in the OCPJP forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "FileWriter & UTF-8 Encoding" Watch "FileWriter & UTF-8 Encoding" New topic
Author

FileWriter & UTF-8 Encoding

John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Hi,
I have an existing code that uses a FileWriter to write an XML. The output xml file fails during validation while trying to validate the file against UTF-8 format. The below processing instruction is added to the top of the xml which makes it fail "<?xml version="1.0" encoding="UTF-8"?>". I searched and found that only OutputStreamWriter has the ability to set the encoding and FileWriter does not have it. Can you suggest if there is a way of setting UTF-8 encoding to a FileWriter & explain me what the docs ask me to do here -
The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable. To specify these values yourself, construct an OutputStreamWriter on a FileOutputStream.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18708
    
    8

Like this:

However it's better not to do that. Fix that code to write the XML to an OutputStream -- a FileOutputStream would be file -- and the XML serializer will take care of matching the encoding it uses to the encoding declared in the document.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Thanks Paul. When i try to do that I get an error - "The constructor FileWriter(OutputStreamWriter) is undefined". I use JDK1.6_25. Am i missing something?
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
To add to that, the xml is a database extract output and contains ~125000 lines and will be written as the code runs in buffer. I think that XMLSerializer will be fine when writing a DOM document with an encoding as an XML and not for writing buffer by buffer a big xml. Correct me if wrong. Anyways I cannot modify the Xml writing code at this point.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18708
    
    8

No, that's me missing something. Sorry. The answer is, you can't combine FileWriter and OutputStreamWriter in that way.

However I think you should be asking your original question, which should have been "How can I get my program to output XML with the correct encoding?" And as I already said, you shouldn't be using any kind of Writer.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18708
    
    8

John Jai wrote:Anyways I cannot modify the Xml writing code at this point.


Then why are you even asking the question? It has bugs, sure, but you can't change it. So onwards to the next project.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul Clapham wrote:
John Jai wrote:Anyways I cannot modify the Xml writing code at this point.

Then why are you even asking the question? It has bugs, sure, but you can't change it. So onwards to the next project.

Paul I meant that i cannot modify the utility code that creates the xml file. I need to supply a FileWriter that has an UTF-8 encoding on it
Ok... this is the code snippet of that utility code & below follows what i am trying to do.

i am trying to do this ->
I cant change the existing. But let me try write new code that writes the xml buffer by buffer and enables UTF-8 format on it. Thanks Paul!
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18708
    
    8

John Jai wrote:Paul I meant that i cannot modify the utility code that creates the xml file. I need to supply a FileWriter that has an UTF-8 encoding on it...


Ah, okay, I understand now. Is that the only possible constructor you can use? If so then you'll want to get that fixed as soon as possible.

If you were writing utility code for XML, it would be a good practice to allow the user several different choices for where their data will be sent. If you were only providing a single choice, then you should choose OutputStream, so that the user still has flexibility. But providing FileWriter as the only possible choice is an appallingly bad design. You have already found one defect in that design.

However if you need a workaround while the powers-that-be are deciding whether to fix that broken design, your recourse is to set the system default encoding to UTF-8, so that your FileWriter will then use UTF-8 as its encoding. Note that all other FileWriters and FileReaders in the system will also use UTF-8, so you would have to evaluate the consequences of that. You do this by including

in the command line when you start your application.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18708
    
    8

By the way there are several other problems apparent just from the small piece of code you posted. For example it should be escaping quotes in attribute values, but it isn't.

Not to mention that the utility is writing <?xml version="1.0" encoding="UTF-8"?> at the top of every document whereas, knowing that it's going to be writing to a file with the system's default charset, it should have written that default charset instead of UTF-8 in the prolog.
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
Paul Clapham wrote:For example it should be escaping quotes in attribute values, but it isn't.

It is escaping here right -> \"" + value + "\""

Paul Clapham wrote:Not to mention that the utility is writing <?xml version="1.0" encoding="UTF-8"?> at the top of every document

No the utility is not writing. It is me who is forcing it to put it on top as i got another exception like -> "The processing instruction must begin with the name of the target."
Also my next requirement is to build a XML Parser to read and convert this xml into a .ntriples file format to be input to an RDF Store. So i thought i better put the processing instruction with the "UTF-8" encoding before the XML Parser scolds me to have a processing instruction with UTF-8 format. It will right?

One more thing i got idea of putting the xml declaration after your answer in this old thread.. coincidence :)Processing Instruction error
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18708
    
    8

John Jai wrote:
Paul Clapham wrote:For example it should be escaping quotes in attribute values, but it isn't.

It is escaping here right -> \"" + value + "\""

Paul Clapham wrote:Not to mention that the utility is writing <?xml version="1.0" encoding="UTF-8"?> at the top of every document

No the utility is not writing. It is me who is forcing it to put it on top as i got another exception like -> "The processing instruction must begin with the name of the target."
Also my next requirement is to build a XML Parser to read and convert this xml into a .ntriples file format to be input to an RDF Store. So i thought i better put the processing instruction with the "UTF-8" encoding before the XML Parser scolds me to have a processing instruction with UTF-8 format. It will right?

One more thing i got idea of putting the xml declaration after your answer in this old thread.. coincidence :)Processing Instruction error


No, that doesn't escape quotes in attribute values. It simply surrounds them by quotes. That is certainly a requirement of XML, but it doesn't cover the case where the attribute value itself contains quotes, in which case those quotes must be escaped. For example:

in which the quotes around Motorhead should be escaped.

And are you saying that your code is inserting the prolog <?xml version="1.0" encoding="UTF-8"?>? If so then you're the one making the mistake. You are saying the file is encoded in UTF-8, but it isn't. Just declare it correctly: don't use UTF-8, use the system's default encoding. You can get that like this:
John Jai
Bartender

Joined: May 31, 2011
Posts: 1776
That was a perfect solution to my problem . Many thanks Paul!!!
and I will check on that quotes escaping for Attributes issue too...
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18708
    
    8

John Jai wrote:That was a perfect solution to my problem . Many thanks Paul!!!

Glad we finally got to the source of the problem.
and I will check on that quotes escaping for Attributes issue too...

In the code which can't be changed?

Probably that code wasn't written as a general-purpose XML serializer, so the writer didn't worry about things which probably weren't going to happen while he or she was still an employee. However if I spotted that problem in a small segment of the code, chances are there are more problems.
 
It is sorta covered in the JavaRanch Style Guide.
 
subject: FileWriter & UTF-8 Encoding