File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes General Computing and the fly likes test file encoding mess ? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Java Interview Guide this week in the Jobs Discussion forum!
JavaRanch » Java Forums » Engineering » General Computing
Bookmark "test file encoding mess ?" Watch "test file encoding mess ?" New topic

test file encoding mess ?

gurpeet singh
Ranch Hand

Joined: Apr 04, 2012
Posts: 924

i have a text file containing semi colon seperated string values in it. upon checking the encoding of the file i found that it was ANSI. some of the strings in the text file contains some foreign characters , i think they are french. for e.g there is a character 'u' with 2 dots on it. i'm using load infile command of mysql to populate the data in my database. when i populate the data in the database, some of the strings have question marks(?) in it.

i know this is encoding issue. upon reading from the internet i read that everything should be of utf-8. i converted my database and the table into utf-8. also i saved the file in utf-8 format. when i ran load local infile query now the problem worsened. the 'u with 2 dots' i talked about earlier now has weired characters in its place. in short the problem was not resolved. i read joels absolute minimum every software developer should know given at but i do not know what to do.

i'm using jdbc for data connection. also the problem happens even if i run load local infile directly in mysql client without using jdbc .

please help me what can be done so that exact same data as in text file is populated in mysql .
Campbell Ritchie

Joined: Oct 13, 2005
Posts: 46437
Is the character you found ü? If so, that’s not French. German, more likely.
What’s ANSI encoding? I haven’t come across it. Did you mean ASCII? That isn’t ASCII because ü isn’t an ASCII character.
How do you know that a file sent across the net is in UTF-8? Agree with people who say to put everything on the net into UTF-8, but that doesn’t mean everybody else has seen that recommendation.
Joel Spolsky’s article is useful by reminding you that encodings cause problems and you need to know which encoding to use. What he doesn’t tell you is that it is the responsibility of the provider of a file to ensure it is legible to users, not for users to work out how to read it.
  • 1: Find who provided the file and ask them for details.
  • 2: Try opening the file with a word processor. Many will try different encodings, or even give you a list of encodings to try.
  • 3: Write a little Java program which reads the text file and prints it, taking different encodings.
  • In the case of 2 and 3, see which encoding gives you a sensible output. It helps if you know what the file says before you try.
    Beware: I tried reading some UTF-8 files in ISO8859-1 once, and found no difference in the result. Some characters come out the same in both those encodings.
    gurpeet singh
    Ranch Hand

    Joined: Apr 04, 2012
    Posts: 924

    when i open the file in notepad and try to save it , in the encoding dialog box it defaults to ANSI which means the initial encoding is ANSI.

    also as you said i tried opening the file in microsoft word, it gave me a dialog box to choose the encoding with a file preview. in the dialog box the default option of encoding , which was already selected for me was WESTERN EUROPEAN. the file preview was what i wanted, i.e. ü was shown as ü.

    when i changed the encoding to utf-8 , instead of ü, it gave me ?

    ain't utf-8 contains all the characters in the universe ? utf-8 should give me ü as ü. right ?

    is it related somehow to mysql charset and collation setting ? i do not know what they are and i'm doing bit of google on that . till now i have found that many users are affected by this encoding mess and i havent found proper solution yet. i have tried all the possible combinations on my mysql server. i have changed the charset and collation setting to utf-8 on my database and tables. however i'm not able to change the charset setting for server. it is still showing latin_swedish. can it be the cause of the problem ?
    I agree. Here's the link:
    subject: test file encoding mess ?
    It's not a secret anymore!