Hello, I needed some clarification on a few concepts surrounding Unicode streams within a java program. The following is the scenario: I have an XML file which is encoded as UTF-8. I need to read this as a unicode stream. For characters that lie outside a particular unicode range, I need to replace them with thier hex equivalents. I was thinking of reading the input stream byte at a time and comparing if it was within the range or outside it. Now, depending on the character, it could be represented in more than one byte (UTF-8 i believe could use between 1 and 4 bytes). How can I be assured that the byte I am reading is on its own (i.e single byte rep) or it requires me to read the next one to make sense of what character it is? Could anyone please throw some light.. my head is spinning!
If you want to interpret a stream as Unicode chars, you probably want a Reader or Writer. I recommend using an InputStreamReader to convert a stream using a particular specified encoding:
"I'm not back." - Bill Harding, Twister
Joined: Jul 11, 2002
Jim, Thanks for the reponse. This seems to be the way forward. Do you have any comment on the performance impact of these actions if the size of the XML is moderate (10-12K)?
Joined: Jan 30, 2000
Do you have any comment on the performance impact of these actions if the size of the XML is moderate (10-12K)? Not really. 10-12 k doesn't sound very big to me; I don't think you'll notice any performance problem. (Unless you're processing a lot of 10-12 k files.) If you find you need to speed things up, then if you're using 1.4 you can use a FileChannel instead, and use the Charset class to encode and decode bytes/chars. It's probably unnecessary though.