This week's book giveaways are in the Refactoring and Agile forums.
We're giving away four copies each of Re-engineering Legacy Software and Docker in Action and have the authors on-line!
See this thread and this one for details.
Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Remove multiple occurences of XML nodes

 
Tausif Farooqi
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I need to write a Java method that removes multiple occurences of a node (and its contents) from within an XML (supplied as a String).

Here's a sample of the XML



I need to remove all occurences of the element "OLifEExtension" and its contents. I've written a fairly simple method given below, it works but it is very inefficient and takes a lot of time if the XML is large (>=10 MB)



I've also tried regular expressions but can't figure one that works. I've tried the following:

1. <OLifEExtension[^>]+>.+?</OLifEExtension>
2. <OLifEExtension .*?>.*?</OLifEExtension>
3. <OLifEExtension[^>]+/>|<OLifEExtension[^>]+>.+</OLifEExtension>

None of the above regular expressions work. Instead of matching the first "OLifEExtension" element, it matches everything between the first opening "OLifEExtension" and the last ending "OLifEExtension" tag.

Can anyone please tell me a more efficient way of doing this or kindly provide me with a regular expression that will do the job for me?

Many many thanks in advance.
[ December 14, 2008: Message edited by: Tausif Farooqi ]
 
William Brogden
Author and all-around good cowpoke
Rancher
Posts: 13056
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Probably the reason it is slow is you modify the whole string every cycle which makes for lots of large object creation.

If the XML is really formatted that regularly you could read it line by line (see java.io.BufferedReader and StringReader, writing to an output java.io.StringWriter but skipping the lines between the start and end tags.

Bill
 
Tausif Farooqi
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for the suggestion Bill, but the problem is that I can't assume that the XML will be properly formatted as its coming from an external source. I can try putting line breaks between every adjecent ">" and "<" and try what you've suggested and see if it makes a difference.
 
Tausif Farooqi
Greenhorn
Posts: 10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Bill, you were right about the String contatenation part! I changed the method to this:

And it runs nearly 400 times faster than the previous method! Thanks for the help!
[ December 14, 2008: Message edited by: Tausif Farooqi ]
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic