I am trying to write an application that would read a file from the internet (www.example.com/file.html), do some editing and then write it to a file on my disk. The problem is that central european characters are not shown correctly in the file on my disk. I know that web page uses iso-8859-2. I tried a few things but was not successful. How should I modify my code to get the proper result?
The trick is to get encodings right.This is mistake #1. The InputStreamReader needs to know about the specific encoding it is getting -- pull it from the HTTP response headers or, slightly uglier, hardcode iso-8859-2. Check the javadoc API for the appropriate constructor.And this is certainly a cardinal sin in internationalised Java. You are using the write(int) method of OutputStream, which will just chop off the top 8 bits of your char and write out a byte. This basically ignores any encoding that's being used and will only ever work properly for 7-bits ASCII stuff. What you need to do is use FileWriter instead of FileOutputStream; this will write Strings directly using your default encoding. Alternatively, if the default encoding won't do, simply wrap your FileOutputStream inside an OutputStreamWriter; you can use the latter's constructor to ask for any encoding that takes your fancy. As long as it is supported by your JRE, of course. - Peter [ October 11, 2003: Message edited by: Peter den Haan ]
posted 16 years ago
Thanx for your suggestions Peter, i kind of got it working. Now can you help with some code that would get encoding of a particular file on the internet. Is there a method or do I have to check for <meta> tag to get proper encoding? thanx in advance
Peter den Haan
posted 16 years ago
The character set used is returned as part of the HTTP headers, not necessarily of the actual response body. For instance, this JavaRanch page arrived at my browser with the following headers:(courtesy of Mozilla Firebird with the Live HTTP Headers plugin). As you see, it's the Content-Type header that (optionally) supplies you with the encoding being used on the web page. To get at the HTTP headers, don't open the input stream from the URL object but open the connection explicitly:Hope this helps, - Peter