This week's book giveaway is in the OO, Patterns, UML and Refactoring forum. We're giving away four copies of Refactoring for Software Design Smells: Managing Technical Debt and have Girish Suryanarayana, Ganesh Samarthyam & Tushar Sharma on-line! See this thread for details.
You are right re java using UTF. UTF apparently stands for UCS Transformation Format. If you refer to the documentation on the DataInputStream class using the API, this is part of what it says....
Data input streams and data output streams represent Unicode strings in a format that is a slight modification of UTF-8. (For more information, see X/Open Company Ltd., "File System Safe UCS Transformation Format (FSS_UTF)", X/Open Preliminary Specification, Document Number: P316. This information also appears in ISO/IEC 10646, Annex P.)
So if you are really interested, you might want to dig into the document mentioned above. But for the purposes of Java it is necessary (maynot be sufficient) to understand that the strings are represented as Unicode and that All characters in the range '\u0001' to '\u007F' are represented by a single byte. The null character '\u0000' and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes. Characters in the range '\u0800' to '\uFFFF' are represented by three bytes. The multi-byte chars are used for most of the Asian languages. English uses single byte chars (7-bit). Most European languages use single byte chars (8-bit). This is how I18N is achieved in java.