• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Unable to read charset using java.net.URLConnection

 
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Using java.net package, I am trying to read a html page, which has Content-Type as

code:

<meta content="text/html; charset=euc-kr" http-equiv="Content-Type" />



Now it is very critical for me to be able to read the charset which is mentioned in tag above.

Using urlConnection.getContentType(), urlConnection.getHeaderField("Content-Type") just returns "text/html", which I believe is because the above methods derive value from some other place rather than the <meta> tag shown above.

Is there a way of getting the values of <meta> tags beforehand so that one can determine what charset to use while reading ?.

I need to read a html page and write that to a already initialized response object. For that it is critical for me to determine the encoding of the html file.

Transferring bytes directly from InputStream to response OutputStream for which I need not care about encoding, is not working as the response.getWriter() has already been called and hence response.getOutputStream() throws IllegalStateException !!!.

Someone please advise ways to resolve the problem
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You could read the complete page into a byte array, and then search that for the meta tag. (That's assuming that the meta tag's characters are the same in ASCII as they are in the real encoding, of course, ...)
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Using urlConnection.getContentType(), urlConnection.getHeaderField("Content-Type") just returns "text/html", which I believe is because the above methods derive value from some other place rather than the <meta> tag shown above.



Exactly! In order to see exactly what is going on I recommend you use the Firefox browser with the FireBug plugin or one of the other plug ins that show exactly what the browser is getting from where.

LiveHttpHeaders plugin is also helpful

Bill
 
Bartender
Posts: 9626
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
What does getContentEncoding() return?
I'd expect the content type to always be "text/html" for an HTML document. What you are changing is the character encoding.
 
Rudy B Baylor
Greenhorn
Posts: 6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Joe
getContentEncoding() returns null
I have even tried URLConnection.guessContentTypeFromStream(urlConn.getInputStream()) but even that returns null

Basically throughout our application we have been using charset within content-type http header to specify encoding and that might be the reason for the content-encoding to show as null.



Hi William,
I have tried using FireBug as well as HttpFox to get a clue as regds how the browser is able to locate the required character encoding.
I checked in the response header but they also only indicate Content-Type as text/html and no attribute containing Character encoding value. My guess is that the browser must be adopting the same approach which Ulf has suggested for finding the correct encoding i.e reading the http-equiv meta values based on reading the response as bytes and then decoding it accordingly.
 
Joe Ess
Bartender
Posts: 9626
16
Mac OS X Linux Windows
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Rudy B Baylor:
Hi Joe
getContentEncoding() returns null



That's my best guess. . .
I can't find it now but somewhere in the HTML spec it was talking about how the server is responsible for turning that tag into a header and it hinted that many servers may not do it correctly (or at all). You may want to have a look at your server's docs and errata and see if that's the case.
Or take Ulf's suggestion, since that's where you'll be at if the above is true.
 
reply
    Bookmark Topic Watch Topic
  • New Topic