This week's book giveaway is in the Cloud forum.
We're giving away four copies of Terraform in Action and have Scott Winkler on-line!
See this thread for details.
Win a copy of Terraform in Action this week in the Cloud forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Tim Cooke
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Liutauras Vilda
Sheriffs:
  • Jeanne Boyarsky
  • Rob Spoor
  • Bear Bibeault
Saloon Keepers:
  • Jesse Silverman
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
Bartenders:
  • Piet Souris
  • Al Hobbs
  • salvin francis

Parsing a BLOB XML object

 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi,
I am having a really strange problem, I am fetching a database BLOB object containing the XMLs and then parsing the XMLs. The XMLs are having some UTF-8 Encoded characters and when I am reading the XML from the BLOB, these characters lose their encoding, I had tried doing several things, but there is no means I am able to retain their UTF encoding. The characters causing real problem are mainly double qoutes, inverted commas, and apostrophe. I am attaching the piece of code below and you can see certain things I had ended up doing. What else can I try, I am using JAXP parser but I dont think that changing the parser may help because, here I am storing the XML file as I get from the database and on the very first stage it gets corrupted and I have to retain the UTF encoding. I tried to get the encoding info from the xml and it tells me cp1252 encoding, where did this come into picture and I couldn't try it retaining back to UTF -8
Here in the temp.xml itself gets corrupted. I had spend some 3 days on this issue. Help needed!!!


ResultSet rs = null;
Statement stmt = null;
Connection connection = null;
InputStream inputStream = null;

long cifElementId = -1;
//Blob xmlData = null;
BLOB xmlData=null;
String xmlText = null;

RubricBean rubricBean = null;
ArrayList arrayBean = new ArrayList();

rs = stmt.executeQuery(strQuery);

// Iterate till result set has data
while (rs.next()) {

rubricBean = new RubricBean();

cifElementId = rs.getLong("CIF_ELEMENT_ID");
// get xml data which is in Blob format

xmlData = (oracle.sql.BLOB)rs.getBlob("XML");
// Read Input stream from blob data
inputStream =(InputStream)xmlData.getBinaryStream();

// Reading the inputstream of data into an array of bytes.
byte[] bytes = new byte[(int)xmlData.length()];
inputStream.read(bytes);
// Get the String object from byte array

xmlText = new String(bytes);
// xmlText=new String(szTemp.getBytes("UTF-8"));
//xmlText = convertToUTF(xmlText);

File file = new File("C:\\temp.xml");
file.createNewFile();
// Write to temp file
java.io.BufferedWriter out = new java.io.BufferedWriter(new java.io.FileWriter(file));
out.write(xmlText);
out.close();
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

double qoutes, inverted commas, and apostrophe



If those characters come from MS Word files, they are NOT legal Unicode and will always cause XML parser errors. You will need to filter them out and replace with legal. MS Word "smart" punctuation has caused a lot of people a lot of trouble in XML.
Bill
 
Ranch Hand
Posts: 221
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by William Brogden:


If those characters come from MS Word files, they are NOT legal Unicode and will always cause XML parser errors. You will need to filter them out and replace with legal. MS Word "smart" punctuation has caused a lot of people a lot of trouble in XML.
Bill



I can vouch for that
 
achal madkan
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
No I am not reading it from any Word document.
But my idea here is not to save the xml from the BLOB object to a file, I have to fetch the XML from blob object and parse it, and while parsing it, I have to modify some of the elements of the XML. But before parsing it, I just tried to test the xml, whether its integrity and encoding are retained or not. And here before operating on it, I am getting the (un) encoded xml. So, there if on the very first stage, its getting corrupted/losing encoding, there is no use of parsing such and xml and modifying it and uploading it back to the database.
while uploading it back to the database, I need the xml to be the same as I downloaded it but removing 1 element node from it.
I am using the JAXP parser.
 
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
From what you've said, what's in the database is encoded in Cp1252, and the XML identifies it as such. How did it get to be Cp1252? I don't know, but it was done by whoever generated the content in the first place. Presumably they were using a Microsoft product, and that's the encoding it used by default. Now the problem is, when you read this stuff out of the database, you're (implicitly or explicitly) assuming it's in a different encoding.

This uses your platform default encoding. I don't know what that is, but based on your problems, it's probably not Cp1252.
One solution is to explicitly use Cp-1252. E.g.

A better solution is to let the parser do this for you. The XML identifies what encoding it's using right at the beginning, and a good parser interprets this info and uses it to parse the rest of the document. Note that SAXParser has several parse() methods that take an InputStream as a parameter. This is exactly what getBinaryStream() is giving you. Just pass that stream to the parse() method, and don't worry about trying to convert it to a byte[] or String. The parser does that for you. And since it looks up the encoding from the XML, it should use the correct encoding. (Assuming the XML is valid to begin with, but from what you've said so far there's no reason to think that's not the case.)

Incidentally, if you ever need to read from a stream again, don't do this:

If you read the API for read(), you see there's no guarantee that all the bytes will be read at once. It might only partly fill the array. This can happen because for one reason or another, the system couldn't efficiently get all the bytes at once, and the API is designed to allow you to get a partial result quickly (if that's what you want), rather than forcing you to wait for the whole thing. The problem is, often we really want the whole thing. So you need some sort of loop to ensure you've filled the byte[] completely. Here's one way to do that:

Now if you use the SAXParser as I suggested, you won't need to do this anyway. But for the future, it's important to note that many of the read() methods do not guarantee they will read the contents completely, and so it's usually necessary to put them in a loop of some kind.
[ March 19, 2005: Message edited by: Jim Yingst ]
 
achal madkan
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks a lot Jim,
I am presently using the DOM parser and the concept you told me about reading the Bytes was new to me, I will try working in the direction you provided me and let you know if I face any difficulties.
Thanks again!
-achal
 
Jim Yingst
Wanderer
Posts: 18671
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well, a DocumentBuilder also has parse() methods that accept an InputStream. So you can do the same sort of thing with DOM that I described for SAX.
[ March 20, 2005: Message edited by: Jim Yingst ]
 
achal madkan
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Thanks again Jim,
Well I have changed my parser from Oracle's parser to XercesJ and it supports to retain the UTF-8 encoding. Now, I am able to successfully upload the xmls back again.
-achal
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic