This week's book giveaway is in the Mac OS forum.
We're giving away four copies of a choice of "Take Control of Upgrading to Yosemite" or "Take Control of Automating Your Mac" and have Joe Kissell on-line!
See this thread for details.
The moose likes XML and Related Technologies and the fly likes Parsing Chinese Characters by using Xerces Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Parsing Chinese Characters by using Xerces" Watch "Parsing Chinese Characters by using Xerces" New topic
Author

Parsing Chinese Characters by using Xerces

Ed Tang
Greenhorn

Joined: Aug 04, 2010
Posts: 3
May I know does Xerces support Chinese Characters?

My XML file is encoded in UTF-8 format and including some Chinese characters in some tags. However, when I try to print the characters in char[] data (which are Chinese characters) in the method 'characters', some strange characters are returned '???'. May I know how to get the correct Chinese characters after getting the char[] data?


public void characters(char[] data, int start, int length){
.......
}

p.s. The platform is AS/400.

Thanks a lot.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Xerces supports all characters which XML supports. And XML certainly supports Chinese characters as you can see by reading the XML Recommendation.

However your question seems misguided to me. You complain about seeing question marks when you print those characters, but then you ask how to get them. It's equally probable that the encoding failure occurs when you try to print the characters which have been correctly got from Xerces.

So you're going to have to explain your process in a bit more detail, not just focusing on Xerces. Especially if you're printing data, which on your platform might well involve some more data conversions.
Ed Tang
Greenhorn

Joined: Aug 04, 2010
Posts: 3
Here is my code...
===============================================

===============================================


===============================================
When endofElement of 'ChineseName' is detected, write record to AS400 physical file.




=========================================

However, the chinese character cannot be written to the file successfully. Hex '3F3F3F' is written instead. When I try to print the value of chineseNameTagVal, '???' is shown in the console....


Could anyone help me? Thanks. I found some hints in the internet, seems need to use the classes of ByteArrayOutputStream & OutputStreamWriter to do some conversion? But I have no idea on how to use these classes....thank you so much!
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8



I thought there might be something like this.

You have a string containing Chinese characters. First you convert it to bytes using ISO-8859-1, which is the Western European character set. Since that character set does not include representations for any Chinese characters, it replaces all of them by question marks before converting them to bytes.

So already you have mangled your data beyond recognition. Converting those question-mark bytes back to chars using the CP937 encoding cannot possibly bring back the original data.

I would just get rid of this line entirely. It's the job of the AS400 and SequentialFile objects to convert from chars (in the Java program) to bytes (in the database), not yours. Just make sure that the job where this running and the database tables both have a suitable CCSID.
 
GeeCON Prague 2014
 
subject: Parsing Chinese Characters by using Xerces