my dog learned polymorphism*
The moose likes Java in General and the fly likes Reading from URL, problems with encoding Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Reading from URL, problems with encoding" Watch "Reading from URL, problems with encoding" New topic
Author

Reading from URL, problems with encoding

Alex Gli
Greenhorn

Joined: Sep 27, 2002
Posts: 4
I am trying to write an application that would read a file from the internet (www.example.com/file.html), do some editing and then write it to a file on my disk. The problem is that central european characters are not shown correctly in the file on my disk. I know that web page uses iso-8859-2. I tried a few things but was not successful. How should I modify my code to get the proper result?

[ October 11, 2003: Message edited by: Alex Gli ]
Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
The trick is to get encodings right.This is mistake #1. The InputStreamReader needs to know about the specific encoding it is getting -- pull it from the HTTP response headers or, slightly uglier, hardcode iso-8859-2. Check the javadoc API for the appropriate constructor.And this is certainly a cardinal sin in internationalised Java. You are using the write(int) method of OutputStream, which will just chop off the top 8 bits of your char and write out a byte. This basically ignores any encoding that's being used and will only ever work properly for 7-bits ASCII stuff. What you need to do is use FileWriter instead of FileOutputStream; this will write Strings directly using your default encoding. Alternatively, if the default encoding won't do, simply wrap your FileOutputStream inside an OutputStreamWriter; you can use the latter's constructor to ask for any encoding that takes your fancy. As long as it is supported by your JRE, of course.
- Peter
[ October 11, 2003: Message edited by: Peter den Haan ]
Alex Gli
Greenhorn

Joined: Sep 27, 2002
Posts: 4
Thanx for your suggestions Peter, i kind of got it working. Now can you help with some code that would get encoding of a particular file on the internet. Is there a method or do I have to check for <meta> tag to get proper encoding?
thanx in advance
Peter den Haan
author
Ranch Hand

Joined: Apr 20, 2000
Posts: 3252
The character set used is returned as part of the HTTP headers, not necessarily of the actual response body. For instance, this JavaRanch page arrived at my browser with the following headers:(courtesy of Mozilla Firebird with the Live HTTP Headers plugin). As you see, it's the Content-Type header that (optionally) supplies you with the encoding being used on the web page. To get at the HTTP headers, don't open the input stream from the URL object but open the connection explicitly:Hope this helps,
- Peter
 
wood burning stoves
 
subject: Reading from URL, problems with encoding
 
Similar Threads
ASCII to EBCIDIC conversion error
How to read html file?
Problem with searching for a file which is in the zip file
Does Not Write To A File
Accessing HTTPS site using URL