Meaningless Drivel is fun!*
The moose likes I/O and Streams and the fly likes how to get rid of leading space when i use writeUTF Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "how to get rid of leading space when i use writeUTF" Watch "how to get rid of leading space when i use writeUTF" New topic
Author

how to get rid of leading space when i use writeUTF

patrick tang
Ranch Hand

Joined: Dec 16, 2001
Posts: 44
hi all,
i've a problem about writeUTF method of DataOutputStream. i was trying to write
the text from a textarea into a file, and later
retrieving the text from that file line by line and separate the line into tokens of string by using StringTokenizer.
however, it seems like that when i write the text
like "a,b,1,2" to a file, the actual file becomes
sth like " a,b,1,2" (or "*a,b,1,2" or "->a,b,1,2"
when i print it to the system.out). so i always get ClassFormatError in runtime when i was trying to tokenize the first character.
i dont know what caused it act like that. is it because of the writeUTF method itself? if so, is there any way that i can get rid of the leading space? (i wish i could use sed/tr, but unfortunately i can't :roll: )
Rob Ross
Bartender

Joined: Jan 07, 2002
Posts: 2205
How are you reading the data back in? Are you taking into account the file encoding? You have to read the string back in using UTF encoding, otherwise you're going to be reading garbage data.
A UTF string starts with a 2 length bytes, which might be what you're seeing at the start of your string. From the javadocs on writeUTF:
Writes two bytes of length information to the output stream, followed by the Java modified UTF representation of every character in the string s. If s is null, a NullPointerException is thrown. Each character in the string s is converted to a group of one, two, or three bytes, depending on the value of the character.


Rob
SCJP 1.4
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
To clarify - it isn't that UTF strings in general start with 2 length bytes. It's that UTF strings written by the DataOutput writeUTF() method start with 2 length bytes. The length bytes have nothing to do with UTF, but with the DataOutput interface specification. Which means that when you read the data back in, you need to use the readUTF() method of a DataInput class (DataInputStream or RandomAccessFile). This is true for most of the methods in DataOutput - they're designed to be read back with DataInput, and if you use other methods, you will find an assortment of strange effects.


"I'm not back." - Bill Harding, Twister
Rob Ross
Bartender

Joined: Jan 07, 2002
Posts: 2205
Is writeUTF using UTF-8, or that just yet another encoding scheme? I vaugely remember something about java not really using UTF, but using a modified version called UTF-8, but I don't recall where it's used exactly, since internally all char and strings are stored in Unicode.
patrick tang
Ranch Hand

Joined: Dec 16, 2001
Posts: 44
thanks rob and jim for the help. so generally speaking, InputStream/OutputStream for primitive
datatypes and Reader/Writer for others like String
right? i'll try both.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Java uses several encodings close to UTF-8, in different contexts. Class files use a modified UTF-8 in which (a) a null character (\u0000) is written using a two-byte format rather than one, so that there is never a byte with value 0, and (b) no UTF-8 encodings requiring more than three bytes for a character are used. The latter is tied to the fact the Java Unicode characters never exceed 0xFFFF, while the Unicode standard does infact define higher characters. I don't have a good understanding of just how this works, but apparnetly we don't really need the higher characters - or they can be represented with cominations of shorter characters.
Anyway, the writeUTF() and readUTF() methods use the same format as the class files, plus the two length bytes. (I was just lazy when I wrote "UTF" rather than "UTF-8" in the preceeding post.) You can also get "real" UTF-8 by using an InputStreamReader or OutputStreamWriter and setting the encoding to "UTF-8". This is how I usually work with UTF-8 files, and it's the main reason I made the previous post. If files are written using writeUTF() and read using an InputStreamReader with real UTF-8, you can get problems.
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Patrick - right. Well, you can use DataInputStream and DataOutputStream for strings, but only if the strings are decoded with the matching class. And be aware that the file won't be human-readable with text editor like notepad or vi - it will have lots of funny undisplayable characters mixed in. Which is fine if you had other binary data to store in the same file - it wouldn't have been readable anyway - but if you have nothing but strings, you probably want Readers and Writers.
Rob Ross
Bartender

Joined: Jan 07, 2002
Posts: 2205
How do non-java applications handle text encodings? How does a generic text editor know the encoding? Is there a suffix convention, or is there a standard header written to the file, etc?
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
There's not really a single standard convention. Most applications just assume you're using whatever the standard encoding is on that platform (ISO-8859-1 for most of us, I think). Some applications and file types have ways to communicate other encodings - e.g. in an XML file you can put an encoding declaration at the beginning. This is possible in HTML as well, but many web pages don't do this properly. So Internet Explorer allows you to manually set the encoding to be assumed as you fiew a page, using the View -> Encoding option. This is also a good way to look at many other types of text files to see how they're encoded - make a copy of the file with a .html extension, then open that copy with Internet Explorer. Asside from a few weird text-to-HTML transformations (like "<b>" being interpreted like this rather than displayed as characters), you can look at how the text appears under various encodings, to get an idea what is most appropriate. Of course, you will also first want to make sure you've downloaded a number of relevant character fonts from update.windows.com - otherwise you just see empty squares, or question marks.
patrick tang
Ranch Hand

Joined: Dec 16, 2001
Posts: 44
hi jim and rob,
just now i used Reader/Writer to replace InputStream/OutStream. the result is very
good. thanks for the help.
actually it took me some time to figure out
how to read the whole stuff from BufferedReader.
it'd be nice if i could find a PrintReader class...
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: how to get rid of leading space when i use writeUTF