• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Tim Cooke
  • Liutauras Vilda
  • Jeanne Boyarsky
  • paul wheaton
Sheriffs:
  • Ron McLeod
  • Devaka Cooray
  • Henry Wong
Saloon Keepers:
  • Tim Holloway
  • Stephan van Hulst
  • Carey Brown
  • Tim Moores
  • Mikalai Zaikin
Bartenders:
  • Frits Walraven

Broken diacritics reading Clob from OracleDB via JDBC

 
Ranch Hand
Posts: 127
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hello,
I have this problem:

- I have oracle DB (12c, Release 12.2.0.1.0 - 64bit)
- I have table with Clob column
- There's problem reading data from this Clob column via JDBC. If the data contains characters with diacritics, the chars are read wrong (cannot get ascii code,
I see only rectangle as a char or question mark). Tried other tables, Clob  is the problem, not particular table.
- Problem is only in read operation, data is written correctly, I've proven it by reading them via ODBC - diacritics ok there.
- We encountered this problem first in Oracle's SQL Developer, which is based on jdbc too and repeated it by a small piece of code written in Java (SQL insert & select)
- The problem is Clob only (other tables' varchar2 columns with diacritics are read ok) and on that particular database only. We tried other 12c databases without problems.
- Java version is 1.8.0_271, jdbc driver is latest ojdbc8.jar

I'm stuck. It seems to me, that the DB must problem,  But I don't know where to go.  Thanks
 
Master Rancher
Posts: 5060
81
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jirka Nejedly wrote:If the data contains characters with diacritics, the chars are read wrong (cannot get ascii code,


This sounds suspicious - if there are diacritics, "ASCII" is probably not what you should be using.  Are you using the getAsciiStream() method?  That would be a bad idea - definitely try using getCharacterStream() instead.  That's true for any character data really.  

If that comment doesn't solve it for you, please show what code you're using to read the CLOB.  Good luck...
 
Saloon Keeper
Posts: 28319
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It's important to realize that Java is not an "ASCII"-based language. Objects of type java.lang.String are Unicode. ASCII was a 7-bit code and did not support any characters with diaritics. Or more precisely, USASCII didn't, and that's what most of us are implying when we say "ASCII". When stored/transmitted in 8-bit form, the 8th bit was either used for parity checking or ignored, depending on hardware.

With the advent of the IBM PC. a variant of ASCII known as ASCIIZ was created and the 8th bit allowed for characters with diacritics.

But again, Java works with Unicode, and a Java character maps into a much larger bitspace. Big enough to define all sorts of characters, including Russion (Cyrillic), Arabic, Chinese, Japanese, even hieroglyphics and cuneiform. Also, alas, emojis.

So far, so good, but when storing extended character set strings in a database, then retrieving and displaying them, there are additional considerations.

The first is what code page the string or CLOB is defined under. That can affect the collating (sorting) order of strings. For example, for much of its lifespan, the default code pages for the MySQL database were an ISO8859 code page for Swedish. I have a recipe database app that got bitten by that, since my recipes aren't in Swedish!

But matching up code pages is only part of the battle. To display the proper characters, you also need a corresponding display font. If you're not using a font that maps to your code page, then you'll either get the wrong character displayed or get the "non-character" character (usually a box, "?" or "." or sometimes just blank or missing).

So everything needs to match up.

These days most systems, both Windows and Unix/Linux default to using UTF-8 as their standard code page. UTF-8 is a compactable form of Unicode.
 
Jiri Nejedly
Ranch Hand
Posts: 127
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I could paste there the java code with reading and writing from/to clob. But Oracle's SQL Developer has the same issue. App with ODBC connection not. Therefore I think that the read operation between database and  JDBC is the problem. I prepared a simple picture describing the problem. The table reproducing this is simple:
CREATE TABLE TESTC   (CLB CLOB);
. and every database we use has the same encoding: NLS_CHARACTERSET - EE8MSWIN1250

 
Tim Holloway
Saloon Keeper
Posts: 28319
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
More likely, you're using different apps and they are using different display fonts.

Before everyone moved to UTF-8, Windows used something like CP-1256. Which seems to be in line with EE8MSWIN1250.

Of course, the other issue here is that JDBC-ODBC hasn't been a valid JDBC driver for a very, very long time. So I think you're working with an out-of-sate setup.
 
Jiri Nejedly
Ranch Hand
Posts: 127
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
It is not the font, I'm sure. See proof below. Here in Czech rep. we always install windows with Czech language support which is cp1250. Therefore we install database to our customers as EE8MSWIN1250. Our appliacation doesn't need any foreign non-czech characters, so there's no need for utf8. I know, that Java use UTF8 internally, but it doesn't matter what DB encoding is, JDBC should extract the data regardlees of DB encoding. It is time to show how I read the Clob in java:


And this is the proof, that font is ok. I added a varchar2 column to test table and put in it the same string with diacritics, it is read ok, wereas clob not. I repeat - SQL Developer is Oracle's product and connects to DB via JDBC



 
Tim Holloway
Saloon Keeper
Posts: 28319
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Java does not use UTF-8 internally. It uses Unicode and only Unicode for Strings. UTF-8 is an external representation of Unicode that allows for eliminating redundant bytes.

I don't think Windows 10 or 11 operate in CP1250, regardless of what language you use. CP1250 was the standard for USA Windows as well, But newer versions of Windows all use UTF-8. And code pages refer to the binary values of character codes ("code points"), not graphical character rendering (fonts).

To be certain, I would do a hex dump of these "Strings" that you're getting from JDBC and see if their code points match Unicode encoding or CP1250 encoding.

Once you know that, check the app you're using to make sure it has been set to use a font compatible with your code page.

Although, again, Java Strings SHOULD be in Unicode form, so if the database isn't returning Unicode, then you'll need to read its string values as byte arrays and use a code-translating String constructor to correct that.
 
Mike Simmons
Master Rancher
Posts: 5060
81
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Tim Holloway wrote:Java does not use UTF-8 internally. It uses Unicode and only Unicode for Strings. UTF-8 is an external representation of Unicode that allows for eliminating redundant bytes.


This gets a bit messy.  UTF-8 is part of Unicode, as is UTF-16 which is the one used by String, according to its own JavaDoc.  Historically and in JDK 8 (which Jiri is using), this is implemented with a char[] array to store the data inside the String class.  But if you look at a more modern JDK you will see that internal to the String class it uses a byte[] array, encoded using either UTF-16 or Latin-1 - the latter being more compact as long as you don't need any more exotic characters.  Who knows what else they might do in the future.  Additionally, Java has always used UTF-8 as part of its class file format.  So it depends where you look.  For purposes of this discussion though, yes in the Strings in memory in the JVM it's using UTF-16, a char[] array.

Other than that, I fully agree with the rest of the post, particularly:

Tim Holloway wrote:To be certain, I would do a hex dump of these "Strings" that you're getting from JDBC and see if their code points match Unicode encoding or CP1250 encoding.



I was hoping to see a bug in the code to read the CLOBs, but now that Jiri has shown that, it looks good.  Well, it does have the effect of removing any and all line separators from the string.  That's a minor bug that doesn't account for the encoding issues Jiri describes.

Jiri, you might want to make sure you have he latest compatible JDBC driver:

https://www.oracle.com/th/database/technologies/appdev/jdbc-downloads.html

For your version, I think that's here:

https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/12.2.0.1/
 
Tim Holloway
Saloon Keeper
Posts: 28319
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Mike Simmons wrote:
This gets a bit messy.  UTF-8 is part of Unicode, as is UTF-16 which is the one used by String, according to its own JavaDoc.  Historically and in JDK 8 (which Jiri is using), this is implemented with a char[] array to store the data inside the String class.  But if you look at a more modern JDK you will see that internal to the String class it uses a byte[] array, encoded using either UTF-16 or Latin-1 - the latter being more compact as long as you don't need any more exotic characters.  



That's of no consequence here. Yes, the original implementations of String were not the most capable, but regardless of how they store data internally, String is a "black box" that always presents Unicode characters. Whther they are internally stored as byte sequences or as Lempel-Zev encoded binary data. When you fetch a character (charAt()) from a String, java.lang.String, you recieve a java.lang.Char type object, which is a Unicode character. Again, how the Char encodes its internal codepoint value is opaque to the application programmer. And the internal changes to String broke no existing Java code, which was a critical constraint on internal dessign.

In short Java can only accurately hold CP1250 strings in byte arrays, never in any Java char or String form. To treat them as strings or chars required code page conversion. Jamming them in by brute force will have undesirable consequences.
 
Tim Holloway
Saloon Keeper
Posts: 28319
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Also, as I've mentioned (as have others) the JDBC-DOBC bridge driver is deprecated. I'm actually surprised that it had been updated as recently as 2020. It was never a good option, though, as it isn't thread-safe and that's an essential requirement for multi-threading systems such as web applications. The reason why FoxPro and MS-Access can't be used to backend webapps as well - they're designed to be single-user databases.
 
Mike Simmons
Master Rancher
Posts: 5060
81
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I agree that the internal details don't really matter for this; not sure why you brought them up.  

Oracle deprecated the JDBC-ODBC driver that was part of the JDK, as of JDK 8 I believe.  It looks like they continued to maintain one elsewhere, though, for years after that.  Yes, the whole stack is quite old and untrustworthy at this point.  It still seems it could worth trying the latest supported version that is available.  
 
Jiri Nejedly
Ranch Hand
Posts: 127
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
This is another development of this case:

- We discovered a new customer with this problem, now we have 2
- We tried two versions of SQL Developer, found out, that it is distributed as a directory containing
JDBC drivers add complete java JRE & JDK

& newest

Both versions have the same behaviour (not correct)
PS.: with my java experiments, I used Java 8 and ojdbc8.jar, no change

I tried to extract clob as binary:

Result is this


Correct results obtained from non-problematic databases should be

, but non utf-8 reading is not correct too - see the 0D char at the beginning and at the end

I think that all this narrowed down the problem to database only. We intend to make Technical Service Request at Oracle's support.

We are also thinking of some workaround like dbms_lob package.

Ragards, Jiri



 
Tim Holloway
Saloon Keeper
Posts: 28319
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Try temporarily reading the object as a Blob. You can use Blob's getBytes() to get a byte array that holds the data exactly as it came from the database and won't have to worry about Java doing unexpected transformation to it.

I tried your "correct" hex and couldn't get a workable rendition from any code page; that includes UTF, ISO8859, and CP125x. About the only thing I didn't try was IBM EBCDIC, which would have been fairly obvious to me.

"0D" if ASCII or Unicode would be the Carriage Return control code (CR). But I don't think that's what you have here.
 
Jiri Nejedly
Ranch Hand
Posts: 127
1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Cannot read Clob as Blob
causes "Ivalid column type: getBLOB not implemented for class oracle.jdbc.driver.T4CClobAccessor"
 
Jiri Nejedly
Ranch Hand
Posts: 127
1
  • Likes 2
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Fimally found a solution - we googled that it is Oracle database bug. Patch 27002286 must be applied.
 
Tim Holloway
Saloon Keeper
Posts: 28319
210
Android Eclipse IDE Tomcat Server Redhat Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
So Oracle was broken.

It's always hardest when the tools you trust turn out to be the problem.

Incidentally, I'm pretty sure that there is a way to read that Clob as a Blob, but it has been a very, very long time since I worked with Oracle. It obviously doesn't matter here, though.

Thanks for sharing the solution. And for your hard work and patience, have a cow!
 
Run away! Run away! Here, take this tiny ad with you:
Gift giving made easy with the permaculture playing cards
https://coderanch.com/t/777758/Gift-giving-easy-permaculture-playing
reply
    Bookmark Topic Watch Topic
  • New Topic