• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Reading Tabs

 
Anthony Smith
Ranch Hand
Posts: 285
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I got a text file that has the following
<TAB> is an actual TAB keystroke
US<TAB> United States USA
CA<TAB> Canada CAN

I just wanted to be able to access the 3 elements in each column so I did the following:
import java.io.*;

public class file

{


public static void main(String[] args)

{
File csv = new File("wl.txt");
try {
DataInputStream in = new DataInputStream(
new FileInputStream("wl.txt"));

DataOutputStream out = new DataOutputStream(
new FileOutputStream("w2.txt"));
char chr;


while (true) {

StringBuffer country_code = new StringBuffer(2);
while ((chr = in.readChar()) != '\t') {
country_code.append(chr);
System.out.println(chr);
}
System.out.println("CC: " + country_code);


StringBuffer country_name = new StringBuffer(20);
while ((chr = in.readChar()) != '\t') {
country_code.append(chr);
}
System.out.println("CN: " + country_name);

StringBuffer district = new StringBuffer(20);
char lineSep = System.getProperty("line.separator").charAt(0);

while ((chr = in.readChar()) != lineSep) {
district.append(chr);
}
System.out.println("D: " + district);
}
}

catch (EOFException e) {

System.out.println(e);
}
// System.

catch (Exception e) {

System.out.println(e);

}
}
}
*************
When I look at the following line, all I see is '?' System.out.println(chr);
What am I doign wrong?
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The readChar() method of DataInputStream reads exactly two bytes and assumes that they are a Unicode representation of a character. The problem is, most text files aren't in Unicode - they're usually in your system's default encoding. On Windows in the Americas and Europe this is usually Cp-1252, which is Microsoft's version for latin-1 encoding (a variant of ASCII). It's a one-byte encoding - which means that the DataInputStream is grabbing two two characters in Cp-1252 and reinterpreting them as one Unicode char, which results in gibberish. Instead of DataInputStream, try a FileReader wrapped in a BufferedReader:

The FileReader takes char of translating the system default encoding into characters, and the BufferedReader takes care of reading one line at a time. What you do with each line you've read is up to you...
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic