• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Difference in reading character encoding from command line and eclipse

 
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,
I have a situation like this:
I am trying to read a properties file and convert it into xml. The properties file contain unicode characters outside the range of ascii character sets. My properties file contains:


I am using ANTLR4 to parse the properties file ( I need the offsets of values and line numbers).

While, running the application by calling the Main class from eclipse, all the unicode characters (e.g. ü ص © ® ° ) are interpreted and written to the xml properly. But, while I am running the application from Windows command line by creating a jar, ANTLR is throwing error like:



Could anybody please help me out here. Are there any difference between calling a class directly and calling the jar from command line with respect to character encoding?
 
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Which encoding are you using for your file?
 
imran sujoy
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
utf-8
 
Campbell Ritchie
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
And I presume you are specifying UTF-8 for your reader in the Java® code?
 
imran sujoy
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Yes.
 
Campbell Ritchie
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Rather than printing to the console, show the text with the old‑fashioned technique of JOptionPane#messageDialog. It is possible it has something to do with the command line, which on Windows has a restricted range of characters available. If that doesn't help, I don't know.

Anybody else?
 
imran sujoy
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Its not about showing the characters in the console. Rather, the ANTLR could not recognize the Unicode characters while running from console which is the problem.

Just now, I saw an interesting thing, the same unicode characters if placed in a json file, while parsing using ANTLR4 it does not give any problem and all Unicode characters are recognized.
 
Campbell Ritchie
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So you are not running ANTLR from Eclipse? Sorry, I thought that is what you are doing. I was getting confused.

Does ANTLR have any restriction on characters it recognises? Does the grammar you gave it permit those characters?
 
imran sujoy
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Um, ANTLR grammar is written to parse certain file. From the grammar file parser, lexer, listener files are generated. Those files are used to read the tokens and extract offsets, text etc. So, ANTLR is not anything to do with Eclipse, it generates certain Java files based on a grammar file and those files can be used anywhere.

And no, ANTLR does not have any restriction in identifying Unicode characters.
 
Campbell Ritchie
Marshal
Posts: 79177
377
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Assuming that means those characters are permitted by the lexer and parser files, sorry, don't know any more.

Anybody else?
 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You said something about specifying UTF-8 in a Reader in some Java code. But now you're asking about ANTLR, so I'm confused. How does ANTLR know that the encoding of your file is UTF-8?
 
imran sujoy
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Ok, let me describe it further. In Antlr I describe grammar, for properties file it can be something like, key separator value. Then further I define what the key, separator and value are; they could be alphabets, numbers, special characters. Separator could be space, colon or equal sign. The more rigorous and exhaustive my grammar would be the more successful my parsing would be. Now think, the properties file contain a greek letter. But since I haven't define unicode characters as my token, antlr would fail to recognize that and would throw an error. So I have included the unicode ranges as well in the possible key value value. Now I am reading a properties file which contains that greek character. While doing that from eclipse I am not facing any problem to "read" the file, but when I am calling the application from console those greek character are not read as UTF8 and here I am not sure, is it because of that the characters are already passed to antlr as some meaningless symbol and hence the problem.
I don't know if I could make this clear enough.
 
Paul Clapham
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

imran sujoy wrote:Now I am reading a properties file which contains that greek character.



I'm going to assume that "I am reading..." in that statement means that some Java code that you wrote is reading the file. (In other words, yes, you could have made it clearer.) And if that works differently in different environments with respect to the encoding being used, then that's most likely because you are using the system's default encoding. Which in turn is most likely because you're not specifying any encoding, and allowing the system to use its default. Seeing some code would be very helpful in this case (and yes, that would make the question clearer too.)

However you seem to think that ANTLR has something to do with this problem, so maybe I'm wrong in my guess.
 
imran sujoy
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Sorry. I thought putting "reading" inside quotes would make it clear that its really the code reading and not me as a person. Probably a non native speaker issue. And I would please ask you to read my first post where the difference in console and eclipse is told, its not an antlr issue.
 
Paul Clapham
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Okay, I re-read your first post which said

But, while I am running the application from Windows command line by creating a jar, ANTLR is throwing error like:



So ANTLR is throwing an exception. But it's not an ANTLR issue?

Okay, let's go on. It appears it's your code which is reading the properties file. I guess I wasn't clear enough in my last post where I was unsure about that. Sorry about that. Let me be clearer: Is your code reading the properties file? If so, let's see the code which does that.
 
imran sujoy
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
The code to read the file is :

The "file" is java.io.File

The the charset passed in the method are read from the BOM, or if BOM is not present its defaulted to UTF-8.

So ANTLR is throwing an exception. But it's not an ANTLR issue?



See, if I pass ANTLR a valid unicode character it can recognize that. But if I pass garbage character while reading the file, ANTLR can't help. So I believe its not ANTLR issue, rather how I am reading the file, and somewhere I am missing the character encoding. Like, as I mentioned, the same unicode characters are recognized while I am (the code) reading the same file by running the Main class from Eclipse. Also, json does not give me trouble when running it even from terminal.
 
Paul Clapham
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It would be surprising (to me) if a properties file had a BOM, so assuming the code you haven't posted works the way you say it does, then (based on the name of the method) presumably the result of that method should be a String containing the data from the properties file. That's also assuming that the BOM you referred to is actually part of the properties file and not part of the XML file you alluded to, where it would be much more reasonable to expect a BOM. That's a whole lot of assumptions which may or may not be correct; it would be a lot easier to discuss this problem if you posted the code.

Have you examined that String to see if it works the way you think it should? (As already suggested earlier.)

And according to your original post, you have ANTLR parsing the properties file... does that mean that you pass the resulting String to ANTLR somehow, resulting in the error you described? Or if not, how does ANTLR get hold of the properties?

 
imran sujoy
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I am sorry, which code I have not posted? I have posted the code of how I read the file. Then I pass the content of the file to AntlrInputStream. The BOM is of the file the code is reading, the xml is the output, I am not reading the BOM of the xml.

This is the ANTLR code, if it help, which read character by character and try to identify the tokens:

 
Paul Clapham
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

imran sujoy wrote:I am sorry, which code I have not posted?



Among other code, the code which determines which encoding to use. However it's not necessary now that you've posted the code which uses the system default encoding:



Here you convert the String to bytes using the system default encoding, which appears to be UTF-8 in Eclipse and something else in the command line. There's a version of the getBytes() method where you can specify the encoding.

I have to say I don't understand why you create a (correct) Reader over the properties file, then create a String from that, then convert that String back to an array of bytes, then create a second Reader over that array of bytes. Why can't ANTLR just be given the first Reader?
 
imran sujoy
Greenhorn
Posts: 21
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

I have to say I don't understand why you create a (correct) Reader over the properties file, then create a String from that, then convert that String back to an array of bytes, then create a second Reader over that array of bytes. Why can't ANTLR just be given the first Reader?



Because, first of all the reading of file is in existing framework, the output content is being used by different clients. So, I don't have much control on that. The file reading is part of the framework and the antlr job is part of a client.
Secondly, I could have directly pass the content to the AntlrInputstream like this:


And that was what I was doing. Later I thought of identifying the character encoding from the input stream and so I have used the reader.

But... but, one mysterious thing, just today morning when I again tried the code from command line, there are no more ANTLR token recognition error. I really don't know how it get solved, but for now, it's working fine.
 
Paul Clapham
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Here's another thing: ANTLR wants a Reader; you have a String; the simplest thing would be to use a StringReader rather than converting the String to bytes and then back to chars again.
reply
    Bookmark Topic Watch Topic
  • New Topic