This week's book giveaway is in the OCAJP 8 forum. We're giving away four copies of OCA Java SE 8 Programmer I Study Guide and have Edward Finegan & Robert Liguori on-line! See this thread for details.
One approach is to use what you know of what encodings might be used for the file, and the examing the content of the file to see which if these is most likely. In most of Western Europe and the Americas, CP1252 is typical for Windows, with some Windows programs using Unicode (UTF-16), and, most Unix programs using ISO 8859-1, I think. You may see CP437 or CP 850 from DOS or some e-mail applications. If you can be a bit more specific about what encodings (or platforms or countries) might be involve, people may be able to offer more specific suggestions.
Joined: Jul 09, 2001
ASCII and IBM-1047 (EBCDIC)
Joined: Feb 22, 2001
One approach: Loop through the data bytes, classifying and counting bytes as probablyEbcdic if they fall in the range of values used by EBCDIC for, say, '0'-'9', 'A'-'I', 'b'-'i', and probablyISO if they fall in the range used by ASCII for those characters. Those characters ranges avoid complications due to control characters and special characters like @, space, or underscore. You'll usually end up with many in bytes one class than the other before you get very deep into the file. You might also watch for every other byte being zero, which suggests UTF-16 data (unicode).
Do the files with EBCDIC have any distinctive file suffix or naming convention? Or were they located in particular directories? Probably not, or you wouldn't be asking this. But if you have no other way to determine the encoding other than by examining the contents of the file, you basically will have to guess. This isn't as bad as it could be, since ASCII and EBCDICare different enough that you can make a fairly good guess based on statistical propreties of the file. Here's what I'd do to start:Rread the first few hundred bytes into an array of bytes. (Or the whole file, if time &/or memory aren't an issue.) Now assume that the encoding is EBCDIC, and convert these bytes to a String. You may get lucky and get some sort of encoding exception, telling you right away that EBCDIC won't work here. Otherwise collect some statistics about patterns found in the characters (see below). Then assume that the encoding was ASCII, and repeat the process. Compare the statistics from the two assumptions, and decide which statistics are more consistent with a "normal" text file. Use this to make an educated guess which encoding is more likely, and proceed with that assumption. So, what sort of "statistics" is appropriate here? You'll have to experiment a bit to see wat works well for your situation. As a first pass, I'd use methods like Character.isLetter(), Character.isDigit(), Character.isWhitespace() to determine what percentage of the chars are in each category. I'd guess that in most text files, at least 90% of all characters will be letters, numbers, or whitespace. The remaining categories are punctuation and control chars - it you see that there are a lot of these, you've probably got the wrong encoding. You can also use regular expressions to look for common patterns. Try matching the following pattern as many times as you can to see how much of the text looks like words separated by whitespace: "\s+[a-zA-Z]+" Of course you can refine the expression to allow for words with apostrophes and hyphens, or other possibilities. This approach can work well if you know something about the expected content of the files in question. If you're expecting words, look for words; if you're expecting comma-separated fields of numbers, look for comma-separated fields of numbers. If there's a lot of variety in the files you're dealing with, it may take a bit of experimentation to find an approach that works well here. Good luck.