aspose file tools*
The moose likes Java in General and the fly likes Unable to read charset using java.net.URLConnection Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Java 8 in Action this week in the Java 8 forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Unable to read charset using java.net.URLConnection" Watch "Unable to read charset using java.net.URLConnection" New topic
Author

Unable to read charset using java.net.URLConnection

Rudy B Baylor
Greenhorn

Joined: Oct 02, 2007
Posts: 6
Using java.net package, I am trying to read a html page, which has Content-Type as

code:

<meta content="text/html; charset=euc-kr" http-equiv="Content-Type" />



Now it is very critical for me to be able to read the charset which is mentioned in tag above.

Using urlConnection.getContentType(), urlConnection.getHeaderField("Content-Type") just returns "text/html", which I believe is because the above methods derive value from some other place rather than the <meta> tag shown above.

Is there a way of getting the values of <meta> tags beforehand so that one can determine what charset to use while reading ?.

I need to read a html page and write that to a already initialized response object. For that it is critical for me to determine the encoding of the html file.

Transferring bytes directly from InputStream to response OutputStream for which I need not care about encoding, is not working as the response.getWriter() has already been called and hence response.getOutputStream() throws IllegalStateException !!!.

Someone please advise ways to resolve the problem
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 39571
    
  27
You could read the complete page into a byte array, and then search that for the meta tag. (That's assuming that the meta tag's characters are the same in ASCII as they are in the real encoding, of course, ...)


Ping & DNS - updated with new look and Ping home screen widget
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12682
    
    5
Using urlConnection.getContentType(), urlConnection.getHeaderField("Content-Type") just returns "text/html", which I believe is because the above methods derive value from some other place rather than the <meta> tag shown above.


Exactly! In order to see exactly what is going on I recommend you use the Firefox browser with the FireBug plugin or one of the other plug ins that show exactly what the browser is getting from where.

LiveHttpHeaders plugin is also helpful

Bill


Java Resources at www.wbrogden.com
Joe Ess
Bartender

Joined: Oct 29, 2001
Posts: 8713
    
    6

What does getContentEncoding() return?
I'd expect the content type to always be "text/html" for an HTML document. What you are changing is the character encoding.


"blabbing like a narcissistic fool with a superiority complex" ~ N.A.
[How To Ask Questions On JavaRanch]
Rudy B Baylor
Greenhorn

Joined: Oct 02, 2007
Posts: 6
Hi Joe
getContentEncoding() returns null
I have even tried URLConnection.guessContentTypeFromStream(urlConn.getInputStream()) but even that returns null

Basically throughout our application we have been using charset within content-type http header to specify encoding and that might be the reason for the content-encoding to show as null.



Hi William,
I have tried using FireBug as well as HttpFox to get a clue as regds how the browser is able to locate the required character encoding.
I checked in the response header but they also only indicate Content-Type as text/html and no attribute containing Character encoding value. My guess is that the browser must be adopting the same approach which Ulf has suggested for finding the correct encoding i.e reading the http-equiv meta values based on reading the response as bytes and then decoding it accordingly.
Joe Ess
Bartender

Joined: Oct 29, 2001
Posts: 8713
    
    6

Originally posted by Rudy B Baylor:
Hi Joe
getContentEncoding() returns null


That's my best guess. . .
I can't find it now but somewhere in the HTML spec it was talking about how the server is responsible for turning that tag into a header and it hinted that many servers may not do it correctly (or at all). You may want to have a look at your server's docs and errata and see if that's the case.
Or take Ulf's suggestion, since that's where you'll be at if the above is true.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Unable to read charset using java.net.URLConnection
 
Similar Threads
Unable to read charset using java.net.URLConnection
Auto-dectect character encoding in JSP
Problem with JSPs in web application
Unusual characters written as ? in mssql server
Soap turning non-ascii chars to garbage