aspose file tools*
The moose likes Java in General and the fly likes How to get CharsetDecoder? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "How to get CharsetDecoder?" Watch "How to get CharsetDecoder?" New topic
Author

How to get CharsetDecoder?

lee chan
Greenhorn

Joined: Jun 14, 2011
Posts: 15
Hi All,

I've set of .txt files which are having different different encoding version. For example,

1) a.txt ---> Encoding version is ANSI
2) b.txt ---> Encoding version is UNICODE
3) c.txt ---> Encoding version is UTF-8

Now,

how can I read these files in a single class?

that is,

If the file path I entered in the console is related to a.txt...
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(dataInputStream));


If the file path I entered in the console is related to b.txt...
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(dataInputStream, "UNICODE"));


If the file path I entered in the console is related to c.txt...
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(dataInputStream, "UTF-8"));

How can I know the the entered any .txt file encoding version?

So that, instead of hardcoding theUNICODE/UTF-8/ANSI, I can get the version in a variable and can use the variable as second parameter to InputStreamReader.

Please help me out.
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1084
    
  10

lee chan wrote:
How can I know the the entered any .txt file encoding version?


You can't with any certainty. As an indication of this - how could one tell the difference between bytes from the ISO-8859-x family? They all use one byte per character and very very frequently use the same byte values for different characters.
lee chan
Greenhorn

Joined: Jun 14, 2011
Posts: 15
Hi Tookey ,

Thanks for your reply. While a file is saving, we can select an option called encode. Please find attached snapshots.

So, How can I get the encoding formats of those two files in java.


[Thumbnail for a_ANSI.JPG]

[Thumbnail for b_UNICODE.JPG]

Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39834
    
  28
Look at this, where you find there is no such thing as Unicode encoding. Whoever wrote that save dialogue made a mistake there. You cannot get the encoding from a file, unless you recorded it somewhere.
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1084
    
  10

lee chan wrote:
So, How can I get the encoding formats of those two files in java.


If you are only trying to discriminate between text files generated by Notepad with either ANSI encoding or UNICODE encoding then the first two bytes of the UNICODE encoded file will be (0xff,0xfe) which is a 'Byte Order Mark' or BOM . I stress that this only works when you want to discriminate between those two encodings and will not work for just any old encoding.

Note - Java will not cleanly handle UNICODE encoded files that have the BOM - it tries to interpret the (0xff,0xfe) BOM pair as characters. The easiest way to deal with this is to just strip the first two characters when reading the file. See http://code.google.com/p/train-graph/source/browse/trunk/src/org/paradise/etrc/data/BOMStripperInputStream.java?r=31 .
Jesper de Jong
Java Cowboy
Saloon Keeper

Joined: Aug 16, 2005
Posts: 14347
    
  22

lee chan wrote:Thanks for your reply. While a file is saving, we can select an option called encode. Please find attached snapshots.

Just because you can select an encoding while saving the file, does not mean that you can find the encoding when reading the file. Text files do not explicitly store the encoding. As Richard Tookey says, there are different encodings which look a lot like each other and there is no way to distinguish between the two automatically.

There are libraries to guess the encoding, for example juniversalchardet. But these will not always guess the encoding correctly, because that's not possible in principle.


Java Beginners FAQ - JavaRanch SCJP FAQ - The Java Tutorial - Java SE 8 API documentation
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8223
    
  23

lee chan wrote:How can I know the the entered any .txt file encoding version[/b]?

So, having read all the good advice so far, the next question is: Are you in control of the text files you're reading in?

If you are, the simplest thing to do would be to first change all the places that write those files to use a standardized file suffix documented by your system. Eg:
.ansi.txt (by which, I assume you mean Windows-1252)
.utf8.txt
.ucs2.txt
(your 'Unicode' format, I suspect)
There may even be an existing suffix standard that you could use; but if not, my suggestion would be to keep it as simple and visual as possible.
Also, because UTF-8 and 7-bit ASCII can both be read as "UTF-8", you could simply use it as the "default" (.txt), and use a suffix as above for anything that isn't UTF-8 or 7-bit ASCII.

Which brings up a final point: There is ONE format that can be distinguished, but only by reading it in its entirity: 7-bit ASCII.
If no byte in the file has a value > 127, then it must be 7-bit ASCII. It may sound crude, but if you have thousands of existing files to "determine", you may find that it culls a large proportion of them, leaving you with only a few to worry about.

If you aren't in control of the files you receive, my suggestion would be to talk with your suppliers about instituting such a system. Alternatively, you could make BOMs (Byte Order Marks) mandatory; but I don't know whether they would cover all the styles you need.

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39834
    
  28
Winston Gutkowski wrote: . . . talk with your suppliers . . .
That is a good point. It is the responsibility of the supplier of a file to make sure it is legible, not the responsibility of the recipient to work out how to read it.
lee chan
Greenhorn

Joined: Jun 14, 2011
Posts: 15
Thanks to all.

I started work as you guys suggested. My work is going smoothly.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39834
    
  28
Well done
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: How to get CharsetDecoder?