Passing UTF-8 strings to Jython PythonInterpreter exec function is not working
posted 3 years ago
I am trying to use Jython org.python.util.PythonInterpreter to execute some python code within Java.
This input python string is from an external source and thus could (or will?) contain UTF-8, SJIS Japanese etc.
I am always getting the output from Python as ??? instead of any meaningful characters for the Japanese input. It is not a problem with print, I had tried writing Python code to print the exact Hex dump and it was 0x3F 0x3F 0x3F.
Printing the String in Java gives correct output.
The Python code also works correctly.
In short, I need to get the Japanese characters to print in Python when passed in from Java via PythonInterpreter.
Warning - this response presents a frig that should not be used unless absolutely necessary and I hate it. Those of you of a sickly or nervous disposittion please stop reading now.
The problem seems to be that the exec() command passes the content of the Japanese string to the python interpreter as bytes created using one of the single byte character encodings but then uses the bytes as if they are UTF-8 bytes. The frig is illustrated by
What this does is to get the bytes of the string using utf-8 and then treat them as bytes of string encoded as iso-8859-1!!! From experience I know that iso-8859-1 maps all the 256 byte values to and from characters without loss.
I don't know enough about the PythonInterpreter class but on the surface it seems flawed when it comes to character encoding. There has to be a better way of dealing with this.