How can I know the the entered any .txt file encoding version?
So that, instead of hardcoding theUNICODE/UTF-8/ANSI, I can get the version in a variable and can use the variable as second parameter to InputStreamReader.
Please help me out.
Richard Tookey
Ranch Hand
Joined: Aug 27, 2012
Posts: 361
posted
0
lee chan wrote:
How can I know the the entered any .txt file encoding version?
You can't with any certainty. As an indication of this - how could one tell the difference between bytes from the ISO-8859-x family? They all use one byte per character and very very frequently use the same byte values for different characters.
lee chan
Greenhorn
Joined: Jun 14, 2011
Posts: 15
posted
0
Hi Tookey ,
Thanks for your reply. While a file is saving, we can select an option called encode. Please find attached snapshots.
So, How can I get the encoding formats of those two files in java.
Campbell Ritchie
Sheriff
Joined: Oct 13, 2005
Posts: 32599
4
posted
0
Look at this, where you find there is no such thing as Unicode encoding. Whoever wrote that save dialogue made a mistake there. You cannot get the encoding from a file, unless you recorded it somewhere.
Richard Tookey
Ranch Hand
Joined: Aug 27, 2012
Posts: 361
posted
0
lee chan wrote:
So, How can I get the encoding formats of those two files in java.
If you are only trying to discriminate between text files generated by Notepad with either ANSI encoding or UNICODE encoding then the first two bytes of the UNICODE encoded file will be (0xff,0xfe) which is a 'Byte Order Mark' or BOM . I stress that this only works when you want to discriminate between those two encodings and will not work for just any old encoding.
lee chan wrote:Thanks for your reply. While a file is saving, we can select an option called encode. Please find attached snapshots.
Just because you can select an encoding while saving the file, does not mean that you can find the encoding when reading the file. Text files do not explicitly store the encoding. As Richard Tookey says, there are different encodings which look a lot like each other and there is no way to distinguish between the two automatically.
There are libraries to guess the encoding, for example juniversalchardet. But these will not always guess the encoding correctly, because that's not possible in principle.
lee chan wrote:How can I know the the entered any .txt file encoding version[/b]?
So, having read all the good advice so far, the next question is: Are you in control of the text files you're reading in?
If you are, the simplest thing to do would be to first change all the places that write those files to use a standardized file suffix documented by your system. Eg:
.ansi.txt (by which, I assume you mean Windows-1252)
.utf8.txt
.ucs2.txt (your 'Unicode' format, I suspect)
There may even be an existing suffix standard that you could use; but if not, my suggestion would be to keep it as simple and visual as possible.
Also, because UTF-8 and 7-bit ASCII can both be read as "UTF-8", you could simply use it as the "default" (.txt), and use a suffix as above for anything that isn't UTF-8 or 7-bit ASCII.
Which brings up a final point: There is ONE format that can be distinguished, but only by reading it in its entirity: 7-bit ASCII.
If no byte in the file has a value > 127, then it must be 7-bit ASCII. It may sound crude, but if you have thousands of existing files to "determine", you may find that it culls a large proportion of them, leaving you with only a few to worry about.
If you aren't in control of the files you receive, my suggestion would be to talk with your suppliers about instituting such a system. Alternatively, you could make BOMs (Byte Order Marks) mandatory; but I don't know whether they would cover all the styles you need.
Winston
Isn't it funny how there's always time and money enough to do it WRONG?
Campbell Ritchie
Sheriff
Joined: Oct 13, 2005
Posts: 32599
4
posted
0
Winston Gutkowski wrote: . . . talk with your suppliers . . .
That is a good point. It is the responsibility of the supplier of a file to make sure it is legible, not the responsibility of the recipient to work out how to read it.
lee chan
Greenhorn
Joined: Jun 14, 2011
Posts: 15
posted
0
Thanks to all.
I started work as you guys suggested. My work is going smoothly.
Campbell Ritchie
Sheriff
Joined: Oct 13, 2005
Posts: 32599
4
posted
0
Well done
I agree. Here's the link: http://ej-technologies/jprofiler - if it wasn't for jprofiler, we would need to
run our stuff on 16 servers instead of 3.