GeeCON Prague 2014*
The moose likes XML and Related Technologies and the fly likes XML replacing char Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "XML replacing char" Watch "XML replacing char" New topic
Forums: Java in General XML and Related Technologies
Author

XML replacing char

Frank van Roekel
Greenhorn

Joined: Jun 10, 2014
Posts: 3
First of all, i'm new to programming, so don't be to hard on me.
second thing, my english is not very good.

Now let's go to the question.
I have an XML file what looks like:



As you can see <result$> and <test$> are not valid.
What i need is a piece of JAVA code that removes those elements.

Bellow you can find the code that I have yet, but I'm getting a SAXException (caused by the $ sign).
Do you guys have any idea how I can remove the invalid elements from the XML file and create a new valid XML?

Thanks in advance!

Jaikiran Pai
Marshal

Joined: Jul 20, 2005
Posts: 10141
    
165

Frank, welcome to CodeRanch!

Where is that xml content coming from into the file? Whatever is writing out that content would ideally have to fix it.

If that's not possible, then in your code where you trying to fix it, instead of reading it as XML, I would suggest that you read it as a plain file (using the File APIs) and do a simple replace on that particular element name (using the String APIs).

Furthermore, I think you could even do all of this in a simple scripting language instead of using Java to do this.



[My Blog] [JavaRanch Journal]
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1064
    
  10

Jaikiran Pai wrote:Frank, welcome to CodeRanch!
I would suggest that you read it as a plain file (using the File APIs) and do a simple replace on that particular element name (using the String APIs).


Of course if one is to use this approach one must take into account the character encoding specified in the first line of the file or use UTF-8 if no character encoding is specified.

If the OP just has one file to process then using a text editor with the appropriate encoding would be the easiest approach. If the OP had multiple files then as Jaikiran suggests then writing a script is in order.

One approach I have used in the past when processing corrupt XML files is to write a filter to apply to the input stream before passing it to the XML parser but this may be overkill if the OP is writing the XML straight back out. I use the Knuth Morris Pratt algorithm since it does not require any backtracking through the input.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Jaikiran Pai wrote:Where is that xml content coming from into the file? Whatever is writing out that content would ideally have to fix it.


Yes, you should really just be rejecting that document. It isn't well-formed XML and the people who sent it to you should be made to stop doing that and to start sending well-formed XML. You should really insist on that quite firmly.
Frank van Roekel
Greenhorn

Joined: Jun 10, 2014
Posts: 3
Jaikiran Pai wrote:

If that's not possible, then in your code where you trying to fix it, instead of reading it as XML, I would suggest that you read it as a plain file (using the File APIs) and do a simple replace on that particular element name (using the String APIs).



Thanks for your answer. I wrote a code that was using the File api. But then they told me (way too late) that I get the XML as a String instead of a file.
So now I wrote a piece of code that changes the dollar sign into a unique string and after that I iterate over the XML and remove the elements containing that unique value.

I think it does what it needs to do ;).
I hope my client thinks the same.

Anyway, thank you all!
Richard Tookey
Ranch Hand

Joined: Aug 27, 2012
Posts: 1064
    
  10

Of course you have made sure that you replace only '$' followed by '>' and that you don't modify the invalid elements so that they have the same name as existing valid elements !
Frank van Roekel
Greenhorn

Joined: Jun 10, 2014
Posts: 3
Richard Tookey wrote:Of course you have made sure that you replace only '$' followed by '>' and that you don't modify the invalid elements so that they have the same name as existing valid elements !


A dollar sign may never exist between the chevrons right? so something like <te$t>1234</te$t> must also be removed?

And yes, I'm sure that I don't remove valid elements ;).

Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42031
    
  64
I advise to go with Paul's advice and reject the document. It's not XML, and any attempt to pretend that it is will likely end in tears sooner or later. For starters, "<result>" and "<result$>" seem to be different things - otherwise, why would they not both be named "<result>"? If you remove the "$", you're making them into the same thing, which may not be the right thing to do.

Or if it is not actually meant to be XML, then you can't use the JAXP APIs on it. Text search and replace would be more appropriate tools in that case. The String class has some methods that make this easy, assuming that this is the only transformation that you need to implement.


Ping & DNS - my free Android networking tools app
sai rama krishna
Ranch Hand

Joined: May 29, 2009
Posts: 265
I think better to approach the team who is supplying corrupt XML to get it fixed.
 
GeeCON Prague 2014
 
subject: XML replacing char