File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Linux / UNIX and the fly likes How to find out the encoding of a text file in UNix Solaries Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » Linux / UNIX
Bookmark "How to find out the encoding of a text file in UNix Solaries" Watch "How to find out the encoding of a text file in UNix Solaries" New topic
Author

How to find out the encoding of a text file in UNix Solaries

Mike Yu
Ranch Hand

Joined: Nov 17, 2001
Posts: 175
Hello,

How to find out the encoding of a text file in Unix Solaries, that is, what encoding was used when the file was created?

Regards,
Mike


Thanks,<br />Mike
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Ask the person who was responsible for creating it.
Mike Yu
Ranch Hand

Joined: Nov 17, 2001
Posts: 175
Thanks Paul for your time to respond to my question.

But, that is not what I am looking for. I am asking a technical question. I wanted to know if Unix Solaries provide a utility/command to do this job or if there are any scripts to do it.

Regards
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

You normally can't tell, if there is no metainformation, and simple textfiles normally don't have any.

I once wrote a script, which depends on the reader to know at least a word of the text, which may help decide what encoding it is.

It depends on sed, grep and iconv - maybe all available on solaris as well.


For example, you make a fast look into the file, and see "Begr??ung", and from the context you reconstruct, it has to be Begrüßung (you will have very differnet examples in you language, I guess) then you start the tool by


Contrary to the usage-message, it needn't be an Umlaut (diphtonge? mutated vowel).


http://home.arcor.de/hirnstrom/bewerbung
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16065
    
  21

In other words, you cannot determine the encoding, only deduce it. That's because there's considerable overlap on most code pages, so the only way to figure out if a specific encoding was used is to find a usage that only makes sense for that encoding.

Note that we're using the word "encoding" here to mean character set encoding. If what you really meant was that you wanted to determine what type of file you're inspecting is, there's a process called "magic" that can be used to scan files for signatures and deduce the filetype from that. For example, the hex sequence 0xCAFEBABE at the head of a file is an indicator that a file is a Java class file.


Customer surveys are for companies who didn't pay proper attention to begin with.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Tim Holloway wrote:... there's considerable overlap on most code pages,

Yes, but maybe it is not the problem then.

When I get windows-text-files, there is normally a bunch of source-encodings, which would fit to the desired output, but if it doesn't make a difference, it doesn't make a difference - (tautological proof ).
Tim Holloway wrote:
Note that we're using the word "encoding" here to mean character set encoding.

From the first question, I don't have much doubt we're talking about character-encoding.
Tim Holloway
Saloon Keeper

Joined: Jun 25, 2001
Posts: 16065
    
  21

Stefan Wagner wrote:
When I get windows-text-files, there is normally a bunch of source-encodings, which would fit to the desired output, but if it doesn't make a difference, it doesn't make a difference - (tautological proof ).


Well, not necessarily. I used to work with a system that had original used IBM mainframe data terminals (green screen) for Data Entry. The application was COBOL-based and there was an inherent assumption that the data being entered and stored was going to be US-EBCDIC. Over the years, the green screens got replaced with IRMA (Windows 3270 emulation software) as PCs replaced the old mainframe "dumb terminals". And people started typing in characters that weren't available on the US 3270 terminal models (where even lower case was often an extra-cost option). Names like Alberto Peña, for example. Then they started shipping the mainframe data to Java apps running on servers. Due to code page mismatches, the foreign character codes got translated into multi-character sequences by the Java convertors and we ended up with things like "Alberto Pen~a". Which was just the start of our troubles. Because this stuff was coming down in rows of fixed-length columns without delimiters, but suddenly some of the fields were no longer the expected size. As were the records themselves.

So suddenly it did make a difference, and a significant one at that.
Stefan Wagner
Ranch Hand

Joined: Jun 02, 2003
Posts: 1923

Tim Holloway wrote:... things like "Alberto Pen~a" ... rows of fixed-length columns

Outch! I can feel the pain! ;)

 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: How to find out the encoding of a text file in UNix Solaries