aspose file tools*
The moose likes Java in General and the fly likes Regex for adding CDATA to XML nodes Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regex for adding CDATA to XML nodes" Watch "Regex for adding CDATA to XML nodes" New topic
Author

Regex for adding CDATA to XML nodes

Tiago Fernandez
Ranch Hand

Joined: May 16, 2003
Posts: 167
Hello folks,

Say I have an XML like this:

<Node>Foo</Node><Node>Bar</Node>

I need to include CDATA prefix/sufix for every <Node/> element. The desired output would be:

<Node><![CDATA[Foo]]></Node><Node><![CDATA[Bar]]></Node>

Do you think it's possible to do this using regular expression without spliting the nodes and applying the transformation for each substring?

Thanks in advance,
Tiago


Tiago Fernandez
http://www.tiago182.spyw.com/
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
It should be possible using String.replaceAll. Is that what you were looking for?


The soul is dyed the color of its thoughts. Think only on those things that are in line with your principles and can bear the light of day. The content of your character is your choice. Day by day, what you do is who you become. Your integrity is your destiny - it is the light that guides your way. - Heraclitus
Bill Cruise
Ranch Hand

Joined: Jun 01, 2007
Posts: 148
It would be tricky to try and do this with regular expression backreferences to replace the entire <Node>.*</Node> string with <Node><![CDATA[.*]]></Node>. You can get around this by doing it in two passes. First replace the opening tag, then the closing tag.

Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Why would that be tricky? I haven't tried, but shouldn't something like the following work?

str.replaceAll("<Node>(.*?)</Node>", "<Node><![CDATA[\\1]]></Node>");
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
Be careful of greedy matching (the default matching mode) in the previous example: otherwise the .* will match everything between the very first <Node> and the very last </Node> in the document!


Charles Lyons (SCJP 1.4, April 2003; SCJP 5, Dec 2006; SCWCD 1.4b, April 2004)
Author of OCEJWCD Study Companion for Oracle Exam 1Z0-899 (ISBN 0955160340 / Amazon Amazon UK )
Bill Cruise
Ranch Hand

Joined: Jun 01, 2007
Posts: 148
That's the first thing I tried too, but the backreference doesn't work with String's replaceAll method because the enclosing parentheses are in the regex and the backreference to the group are in the replacement String.
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
You need to use dollar $ signs for captured sub-sequences... See: Matcher.replaceAll(String)
[ July 03, 2008: Message edited by: Charles Lyons ]
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18991
    
    8

Regex problems aside, you would also have to unescape things like ampersands and less-than symbols. For example you would want to convert to
[ July 03, 2008: Message edited by: Paul Clapham ]
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by Charles Lyons:
You need to use dollar $ signs for captured sub-sequences... See: Matcher.replaceAll(String)

[ July 03, 2008: Message edited by: Charles Lyons ]


Good point!
Ilja Preuss
author
Sheriff

Joined: Jul 11, 2001
Posts: 14112
Originally posted by Charles Lyons:
Be careful of greedy matching (the default matching mode) in the previous example: otherwise the .* will match everything between the very first <Node> and the very last </Node> in the document!


If you are talking about my example, those are exactly *not* greedy.
Charles Lyons
Author
Ranch Hand

Joined: Mar 27, 2003
Posts: 836
If you are talking about my example, those are exactly *not* greedy.
Nope - I did notice the extra "?" in your example. It's just something very easy to overlook, generally not well understood and I thought a good idea to point out!
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
Another complication: can node elements be nested inside other Nodes? That could get ugly:



That might become:



The above is almost certainly not what you want here, but I don't know what you do want. With luck, nesting never occurs, and this will be easier.
Carey Evans
Ranch Hand

Joined: May 27, 2008
Posts: 225

Paul Clapham:
... you would also have to unescape things like ampersands and less-than symbols.

While looking out for ]]&gt;, which can't be part of a CDATA section at all.
[ July 05, 2008: Message edited by: Carey Evans ]
Tiago Fernandez
Ranch Hand

Joined: May 16, 2003
Posts: 167
Thanks everyone! I've taken the following solution for the problem I was having anyways:



Cheers!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regex for adding CDATA to XML nodes