File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes XML and Related Technologies and the fly likes convert to plain text Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "convert to plain text" Watch "convert to plain text" New topic
Author

convert to plain text

tushar bhasme
Ranch Hand

Joined: Feb 11, 2008
Posts: 50
Hi,

We have an xml from client that has some latin characters. We can read the data but these characters later on produce issues in the application. So i wanted to know if there is already an API that can replace any special character to its plain-text counterpart.

Thanks.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42043
    
  64
Wouldn't it be much better to fix the application so it knows how to handle Unicode characters (or whatever those are)?

If you really have a need to replace accented characters, then this may help: http://www.rgagnon.com/javadetails/java-0456.html


Ping & DNS - my free Android networking tools app
tushar bhasme
Ranch Hand

Joined: Feb 11, 2008
Posts: 50
yes, that option is also in consideration... most probably the app issue is that at some point, the encoding used to read the data once it comes into application is UTF-8 which results in malforming of the special characters. Now, we won't really know which encoding we should use while reading that data hence we are more inclined towards converting them to plain text. Please let me know if you anyone has a good approach towards this problem.

Thanks.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42043
    
  64
Now, we won't really know which encoding we should use while reading that data hence we are more inclined towards converting them to plain text.

This sounds odd. Are you saying you have no way of ascertaining what encoding a file is in? Where do they come from? Surely you can standardize on something?

Please let me know if you anyone has a good approach towards this problem.

I already posted an approach how to remove accented characters; does it not work for you?
tushar bhasme
Ranch Hand

Joined: Feb 11, 2008
Posts: 50
the xml file comes from a third party and has no prolog which is completely valid.. sorry i did not see the link earlier...

It does solve my problem, thanks a lot.

And yes, you are right, the application should be corrected too for handling the way its currently doing that's resulting in the malformation of the characters. We are looking into that too.

Thanks.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42043
    
  64
the xml file comes from a third party and has no prolog which is completely valid..

Why don't you ask the 3rd party what the encoding is, or agree with them of what encoding to use?

If no encoding is specified, then it is almost certainly a Unicode variant. See http://www.w3.org/TR/xml/#sec-guessing-no-ext-info for how to decide which specific encoding is used in that case.
tushar bhasme
Ranch Hand

Joined: Feb 11, 2008
Posts: 50
I am not really sure if i understand what is mentioned in the link:

Because each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding must begin with an XML encoding declaration, in which the first characters must be '<?xml', any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading this list, it may help to know that in UCS-4, '<' is " #x0000003C " and '?' is " #x0000003F ", and the Byte Order Mark required of UTF-16 data streams is " #xFEFF ". The notation ## is used to denote any byte value except that two consecutive ##s cannot be both 00.


As per my understanding, its saying it must be accompanied with the prolog (<?xml) which is exactly what is missing in the client xml.

Thanks.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42043
    
  64
What it's saying is that a prolog is NOT optional under certain circumstances. Since your XML doesn't contain a prolog, those circumstances had better not apply to your situation (or the document would not be well-formed).

The tables tell you how to infer the encoding by looking at the first 4 bytes of the document. I still think it would be easier (and less brittle) to negotiate the encoding with the other party, or at least have them use a prolog with an explicit encoding declaration.
tushar bhasme
Ranch Hand

Joined: Feb 11, 2008
Posts: 50
Agreed. But then the decision lies in the hands of the managers how they want to deal with it- ask the third party to give prolog and correct the application to not malform any data OR replace the data with the plain text during import.

Thanks a lot for the help.
 
GeeCON Prague 2014
 
subject: convert to plain text