aspose file tools*
The moose likes Beginning Java and the fly likes can not read remote XML correctly, due to encoding issue Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of The Java EE 7 Tutorial Volume 1 or Volume 2 this week in the Java EE forum
or jQuery UI in Action in the JavaScript forum!
JavaRanch » Java Forums » Java » Beginning Java
Reply locked New topic
Author

can not read remote XML correctly, due to encoding issue

gang lee
Greenhorn

Joined: Jul 19, 2008
Posts: 12
hi guys

I am trying to read URL(which will return a weather info XML in Chinese): "http://www.google.com/ig/api?weather=dalian&hl=zh-CN"
I am using a Japanese windows XP.

the main source code envovled is:

in fact, whether I specify the encoding in new InputStreamReader() does not affect anything: I always get garbage content(except english part).

AND, the fowllowing conversion does not work


Could anybody give me some help?

thanks a lot!

lee

[edit]Add code tags. CR[/edit]
[ July 19, 2008: Message edited by: Campbell Ritchie ]
gang lee
Greenhorn

Joined: Jul 19, 2008
Posts: 12
sorry, the code for conversion is the following way:

because it seems the source XML seems encoded in MS932.
when I do not specify encoding in new InputStreamReader() and I get the encoding is :MS932, by calling InputStreamReader.InputStreamReader().
[edit]Add code tags. CR[/edit]
[ July 19, 2008: Message edited by: Campbell Ritchie ]
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
If this is a valid XML source, the encoding is mentioned in the header. If you'd simply use an XML parser (I like Dom4J), you wouldn't need to worry about the encoding at all, because it would take care of it automatically.


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
gang lee
Greenhorn

Joined: Jul 19, 2008
Posts: 12
thanks Preuss,
the XML is not valid, only well-formed.

<?xml version="1.0" ?>
- <xml_api_reply version="1">
- <weather module_id="0" tab_id="0">
- <forecast_information>
<city data="Dalian, Liaoning" />

this XML is a service from google.
gang lee
Greenhorn

Joined: Jul 19, 2008
Posts: 12
the only way I can get readable content is using following code:

with the URL:
"http://www.google.com/ig/api?weather=dalian&hl=ja"

but what I got is in Japanese, I want content in Chinese.
I tried many combinations of modification, but no one seems work.

please help.
[ July 19, 2008: Message edited by: gang lee ]
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19672
    
  18

Well your URL does specify to give the Japanese version: hl=ja
If Google supports Chinese then you should change it to Chinese.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
gang lee
Greenhorn

Joined: Jul 19, 2008
Posts: 12
thanks to Prime.

Google HAS a chinese version of that weather information,
BUT, when I specify "zh-CN" in URL,I get garbage content(ASCII part is ok.)
AND, the content seems still being encoded with MS932, which is a Japanese character set.

I suspect that the issue is due to my Japanese version of windows XP.
So I tried to set AcceptLanuage, AceeptEncoding headers etc. of my httprequest, but I failed to get correct content again.

Anybody else can help?

the source code is not complex, can anybody give it a try?


Of course, it's better you have a Japanese XP, or else you may not see the issue.
[ July 19, 2008: Message edited by: gang lee ]
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

I've had this issue with XML documents from Google myself. Here's what I had to do:

1. Get the URLConnection and call its getContentEncoding() method.

2. Use the value you get from that in your InputStreamReader, instead of UTF-8 as in your original post.

As Ilja Preuss said earlier (I think), that's the rule for XML documents sent over HTTP; the encoding of the request overrides the encoding stated or implied by the document's prolog.
gang lee
Greenhorn

Joined: Jul 19, 2008
Posts: 12
Thanks to Clapham, but unfortunately,
I get null when try getContentEncoding().
gang lee
Greenhorn

Joined: Jul 19, 2008
Posts: 12
one more thing:
when I get and save the output from: www.google.com/ig/api?weather=dalian&hl=zh-CN

I found the BOM is FF FE, i.e. UTF-16LE.

but the encoding does not seem UTF-16...
because the browser(firefox2) say it's UTF-8 from view->character encoding.

one discouraging issue for beginner!

with the same URL with Internet Exploer: when I see menu: view->encoding, the pop up menu says:
GB2312
Unicode(UTF-8)
X Unicode
Other
and the menu is grey, forbidding user to re-choose.
[ July 19, 2008: Message edited by: gang lee ]
 
 
subject: can not read remote XML correctly, due to encoding issue