• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

How to get CharsetDecoder?

 
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi All,

I've set of .txt files which are having different different encoding version. For example,

1) a.txt ---> Encoding version is ANSI
2) b.txt ---> Encoding version is UNICODE
3) c.txt ---> Encoding version is UTF-8

Now,

how can I read these files in a single class?

that is,

If the file path I entered in the console is related to a.txt...
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(dataInputStream));


If the file path I entered in the console is related to b.txt...
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(dataInputStream, "UNICODE"));


If the file path I entered in the console is related to c.txt...
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(dataInputStream, "UTF-8"));

How can I know the the entered any .txt file encoding version?

So that, instead of hardcoding theUNICODE/UTF-8/ANSI, I can get the version in a variable and can use the variable as second parameter to InputStreamReader.

Please help me out.
 
Bartender
Posts: 1166
17
Netbeans IDE Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

lee chan wrote:
How can I know the the entered any .txt file encoding version?



You can't with any certainty. As an indication of this - how could one tell the difference between bytes from the ISO-8859-x family? They all use one byte per character and very very frequently use the same byte values for different characters.
 
lee chan
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Tookey ,

Thanks for your reply. While a file is saving, we can select an option called encode. Please find attached snapshots.

So, How can I get the encoding formats of those two files in java.
a_ANSI.JPG
[Thumbnail for a_ANSI.JPG]
b_UNICODE.JPG
[Thumbnail for b_UNICODE.JPG]
 
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Look at this, where you find there is no such thing as Unicode encoding. Whoever wrote that save dialogue made a mistake there. You cannot get the encoding from a file, unless you recorded it somewhere.
 
Richard Tookey
Bartender
Posts: 1166
17
Netbeans IDE Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

lee chan wrote:
So, How can I get the encoding formats of those two files in java.



If you are only trying to discriminate between text files generated by Notepad with either ANSI encoding or UNICODE encoding then the first two bytes of the UNICODE encoded file will be (0xff,0xfe) which is a 'Byte Order Mark' or BOM . I stress that this only works when you want to discriminate between those two encodings and will not work for just any old encoding.

Note - Java will not cleanly handle UNICODE encoded files that have the BOM - it tries to interpret the (0xff,0xfe) BOM pair as characters. The easiest way to deal with this is to just strip the first two characters when reading the file. See http://code.google.com/p/train-graph/source/browse/trunk/src/org/paradise/etrc/data/BOMStripperInputStream.java?r=31 .
 
Java Cowboy
Posts: 16084
88
Android Scala IntelliJ IDE Spring Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

lee chan wrote:Thanks for your reply. While a file is saving, we can select an option called encode. Please find attached snapshots.


Just because you can select an encoding while saving the file, does not mean that you can find the encoding when reading the file. Text files do not explicitly store the encoding. As Richard Tookey says, there are different encodings which look a lot like each other and there is no way to distinguish between the two automatically.

There are libraries to guess the encoding, for example juniversalchardet. But these will not always guess the encoding correctly, because that's not possible in principle.
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

lee chan wrote:How can I know the the entered any .txt file encoding version[/b]?


So, having read all the good advice so far, the next question is: Are you in control of the text files you're reading in?

If you are, the simplest thing to do would be to first change all the places that write those files to use a standardized file suffix documented by your system. Eg:
.ansi.txt (by which, I assume you mean Windows-1252)
.utf8.txt
.ucs2.txt
(your 'Unicode' format, I suspect)
There may even be an existing suffix standard that you could use; but if not, my suggestion would be to keep it as simple and visual as possible.
Also, because UTF-8 and 7-bit ASCII can both be read as "UTF-8", you could simply use it as the "default" (.txt), and use a suffix as above for anything that isn't UTF-8 or 7-bit ASCII.

Which brings up a final point: There is ONE format that can be distinguished, but only by reading it in its entirity: 7-bit ASCII.
If no byte in the file has a value > 127, then it must be 7-bit ASCII. It may sound crude, but if you have thousands of existing files to "determine", you may find that it culls a large proportion of them, leaving you with only a few to worry about.

If you aren't in control of the files you receive, my suggestion would be to talk with your suppliers about instituting such a system. Alternatively, you could make BOMs (Byte Order Marks) mandatory; but I don't know whether they would cover all the styles you need.

Winston
 
Campbell Ritchie
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Winston Gutkowski wrote: . . . talk with your suppliers . . .

That is a good point. It is the responsibility of the supplier of a file to make sure it is legible, not the responsibility of the recipient to work out how to read it.
 
lee chan
Greenhorn
Posts: 15
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks to all.

I started work as you guys suggested. My work is going smoothly.
 
Campbell Ritchie
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well done
 
It was the best of times. It was the worst of times. It was a tiny ad.
a bit of art, as a gift, the permaculture playing cards
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic