wood burning stoves 2.0*
The moose likes XML and Related Technologies and the fly likes extraneous characters getting introduced in XML Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "extraneous characters getting introduced in XML" Watch "extraneous characters getting introduced in XML" New topic
Author

extraneous characters getting introduced in XML

Nishant Kashyap
Greenhorn

Joined: Jun 25, 2009
Posts: 4
Hi,
I am facing an issue where extraneous charcaters are getting introduced in The XML which I am creating through JAVA code.

In the logs I can see that:-

<hrxml:Resume xmlns:hrxml="http://ns.hr-xml.org">
............................................................................................
......................................................................................
<hrxmlescription>Currently serve as a team lead on a multiple phased implementation.Organize training in all modules and business processes to facilitate knowledge transfer and usage of applications. Create security matrices as well as build and test security as per client’s requirements. Provide in-depth analysis and diagnosis of application deficiencies to create opportunities for product enhancements.
</hrxmlescription>


The encoding used is "UTF-8"
I cannot find anything in the code which will insert these characters so my guess is that its something to do with the encoding or XML related issue.
What I can figure out is that , sometimes it inserts extra characters for apostrophe(') and bullets .


Please help me out guys.

Thanks,
Nishant
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
What is the source of the text that appears in the Text node inside the <hrxml.... Element?

Kindly show the code that creates that Text node.

If this was my problem the first thing I would do is examine the text with a Hex displaying editor such as UltraEdit.

Bill

(Is Microsoft Word involved at any point in this?)
Nishant Kashyap
Greenhorn

Joined: Jun 25, 2009
Posts: 4
Thnx for the reply

My answers to your questions


1)Source --- MS word file (.doc) (resume)

2)I cannot show you the whole code as its against the company's policy but a small part of it is shown below:-

public Element createAndAddElementText(Element parent, String childName, String text, Namespace localNamespace) {
Element child = null;
if(parent != null && childName != null && childName.length() > 0 && text != null && text.length() > 0) {
child = createAndAddElement(parent, childName, localNamespace);
child.setText(text);
}
return child;
}

public Element createAndAddElement(Element parent, String childName) {
return createAndAddElement(parent, childName, namespace);
}

public Element createAndAddElement(Element parent, String childName, Namespace localNamespace) {
Element child = null;
if(parent != null && childName != null && childName.length() > 0) {
child = new Element(childName, localNamespace);
addElement(parent, child);
}
return child;
}



3) Ultraedit tool :- I was not able to understand and use that tool effectively.




4)yes, MS word is invloved.


My guess:- is it something to do with CDATA?
This issue is not happening consistently.....
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
MS Word documents may be loaded with "smart punctuation" characters that are not legal UNICODE. This causes all sorts of grief when trying to put text from Word into XML or encode it into strings.

Here is code I use to fix "smart" punctuation in a String coming from MS Word:





Bill
Nishant Kashyap
Greenhorn

Joined: Jun 25, 2009
Posts: 4
Thnx William.

I will try the fix suggested by you but the problem which I am facing is that ..I can not reproduce the issue in my dev environment , its happening on prodcution servers(that too very occasionally).

What I can see from the logs that its is happening from our code only but when we are trying to form an XML using the candidate resume(word file) which has already been parsed by resume parser , and also most of the times it is inside the description tag of a field.

Its very annoying since the issue is not reproducible locally.

The "smart punctuation" mentioned by you are the only once or do we have something more?

Anyways thnx a lot for your help.


Regards,
Nishant
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
The "smart punctuation" mentioned by you are the only once or do we have something more?


Those are just the ones I had to fix for a particular project.

If this was my problem I would try to track down the original Word documents which cause the problem - perhaps they have other Microsoft "features" that are sabotaging your application. Embeded images, font changes? People get so used to thinking of Word documents as text, they forget all the junk that is in there.

Bill
Nishant Kashyap
Greenhorn

Joined: Jun 25, 2009
Posts: 4
I have send you the sample resume which caused issue at your personal ID :wbrogden@bga.com.

I cannot post the resume here at the forum.

rite now I am trying evrything to reproduce the issue in my development environment.


Regards,
Nishant
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
This document has some sort of embedded element - probably an image - which
my copy of Open Office can't show. I suspect you are going to have to have
such documents cleaned up by saving as pure text without ANY formatting
or embedded images before attempting to insert
to a database or XML file. It is possible to automate such conversions with
OpenOffice - but I don't know the details.

The Apache Software Foundation "POI" project at

http://poi.apache.org/

may provide a Java toolkit that could be used for extracting the desired text.

Bill

 
Don't get me started about those stupid light bulbs.
 
subject: extraneous characters getting introduced in XML
 
Similar Threads
displaying special characters on struts:jsp page
jsp with POI FILE System concept.
problem with characters é,ã and º
Non ascii characters are padding without reason
Using UTF-8 encoding for jsp textarea field instead HTML encoding