wood burning stoves 2.0*
The moose likes Java in General and the fly likes Character encoding Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Character encoding" Watch "Character encoding" New topic
Author

Character encoding

Thomas Mcfarrow
Ranch Hand

Joined: Jul 09, 2001
Posts: 137
How do I check what character encoding a file contains.
Thanks in advance.
John Dale
Ranch Hand

Joined: Feb 22, 2001
Posts: 399
One approach is to use what you know of what encodings might be used for the file, and the examing the content of the file to see which if these is most likely.
In most of Western Europe and the Americas, CP1252 is typical for Windows, with some Windows programs using Unicode (UTF-16), and, most Unix programs using ISO 8859-1, I think. You may see CP437 or CP 850 from DOS or some e-mail applications.
If you can be a bit more specific about what encodings (or platforms or countries) might be involve, people may be able to offer more specific suggestions.
Thomas Mcfarrow
Ranch Hand

Joined: Jul 09, 2001
Posts: 137
ASCII and IBM-1047 (EBCDIC)
John Dale
Ranch Hand

Joined: Feb 22, 2001
Posts: 399
One approach: Loop through the data bytes, classifying and counting bytes as probablyEbcdic if they fall in the range of values used by EBCDIC for, say, '0'-'9', 'A'-'I', 'b'-'i', and probablyISO if they fall in the range used by ASCII for those characters. Those characters ranges avoid complications due to control characters and special characters like @, space, or underscore. You'll usually end up with many in bytes one class than the other before you get very deep into the file.
You might also watch for every other byte being zero, which suggests UTF-16 data (unicode).
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Do the files with EBCDIC have any distinctive file suffix or naming convention? Or were they located in particular directories? Probably not, or you wouldn't be asking this. But if you have no other way to determine the encoding other than by examining the contents of the file, you basically will have to guess. This isn't as bad as it could be, since ASCII and EBCDICare different enough that you can make a fairly good guess based on statistical propreties of the file.
Here's what I'd do to start:Rread the first few hundred bytes into an array of bytes. (Or the whole file, if time &/or memory aren't an issue.) Now assume that the encoding is EBCDIC, and convert these bytes to a String. You may get lucky and get some sort of encoding exception, telling you right away that EBCDIC won't work here. Otherwise collect some statistics about patterns found in the characters (see below). Then assume that the encoding was ASCII, and repeat the process. Compare the statistics from the two assumptions, and decide which statistics are more consistent with a "normal" text file. Use this to make an educated guess which encoding is more likely, and proceed with that assumption.
So, what sort of "statistics" is appropriate here? You'll have to experiment a bit to see wat works well for your situation. As a first pass, I'd use methods like Character.isLetter(), Character.isDigit(), Character.isWhitespace() to determine what percentage of the chars are in each category. I'd guess that in most text files, at least 90% of all characters will be letters, numbers, or whitespace. The remaining categories are punctuation and control chars - it you see that there are a lot of these, you've probably got the wrong encoding.
You can also use regular expressions to look for common patterns. Try matching the following pattern as many times as you can to see how much of the text looks like words separated by whitespace:
"\s+[a-zA-Z]+"
Of course you can refine the expression to allow for words with apostrophes and hyphens, or other possibilities. This approach can work well if you know something about the expected content of the files in question. If you're expecting words, look for words; if you're expecting comma-separated fields of numbers, look for comma-separated fields of numbers. If there's a lot of variety in the files you're dealing with, it may take a bit of experimentation to find an approach that works well here. Good luck.


"I'm not back." - Bill Harding, Twister
Mapraputa Is
Leverager of our synergies
Sheriff

Joined: Aug 26, 2000
Posts: 10065
another idea


Uncontrolled vocabularies
"I try my best to make *all* my posts nice, even when I feel upset" -- Philippe Maquet
 
wood burning stoves
 
subject: Character encoding