• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regex for adding CDATA to XML nodes

 
Tiago Fernandez
Ranch Hand
Posts: 167
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello folks,

Say I have an XML like this:

<Node>Foo</Node><Node>Bar</Node>

I need to include CDATA prefix/sufix for every <Node/> element. The desired output would be:

<Node><![CDATA[Foo]]></Node><Node><![CDATA[Bar]]></Node>

Do you think it's possible to do this using regular expression without spliting the nodes and applying the transformation for each substring?

Thanks in advance,
Tiago
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It should be possible using String.replaceAll. Is that what you were looking for?
 
Bill Cruise
Ranch Hand
Posts: 148
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
It would be tricky to try and do this with regular expression backreferences to replace the entire <Node>.*</Node> string with <Node><![CDATA[.*]]></Node>. You can get around this by doing it in two passes. First replace the opening tag, then the closing tag.

 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why would that be tricky? I haven't tried, but shouldn't something like the following work?

str.replaceAll("<Node>(.*?)</Node>", "<Node><![CDATA[\\1]]></Node>");
 
Charles Lyons
Author
Ranch Hand
Posts: 836
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Be careful of greedy matching (the default matching mode) in the previous example: otherwise the .* will match everything between the very first <Node> and the very last </Node> in the document!
 
Bill Cruise
Ranch Hand
Posts: 148
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's the first thing I tried too, but the backreference doesn't work with String's replaceAll method because the enclosing parentheses are in the regex and the backreference to the group are in the replacement String.
 
Charles Lyons
Author
Ranch Hand
Posts: 836
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You need to use dollar $ signs for captured sub-sequences... See: Matcher.replaceAll(String)
[ July 03, 2008: Message edited by: Charles Lyons ]
 
Paul Clapham
Sheriff
Pie
Posts: 20725
30
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Regex problems aside, you would also have to unescape things like ampersands and less-than symbols. For example you would want to convert to
[ July 03, 2008: Message edited by: Paul Clapham ]
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Charles Lyons:
You need to use dollar $ signs for captured sub-sequences... See: Matcher.replaceAll(String)

[ July 03, 2008: Message edited by: Charles Lyons ]


Good point!
 
Ilja Preuss
author
Sheriff
Posts: 14112
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Originally posted by Charles Lyons:
Be careful of greedy matching (the default matching mode) in the previous example: otherwise the .* will match everything between the very first <Node> and the very last </Node> in the document!


If you are talking about my example, those are exactly *not* greedy.
 
Charles Lyons
Author
Ranch Hand
Posts: 836
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you are talking about my example, those are exactly *not* greedy.
Nope - I did notice the extra "?" in your example. It's just something very easy to overlook, generally not well understood and I thought a good idea to point out!
 
Mike Simmons
Ranch Hand
Posts: 3028
10
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Another complication: can node elements be nested inside other Nodes? That could get ugly:



That might become:



The above is almost certainly not what you want here, but I don't know what you do want. With luck, nesting never occurs, and this will be easier.
 
Carey Evans
Ranch Hand
Posts: 225
Debian Eclipse IDE Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Paul Clapham:
... you would also have to unescape things like ampersands and less-than symbols.

While looking out for ]]&gt;, which can't be part of a CDATA section at all.
[ July 05, 2008: Message edited by: Carey Evans ]
 
Tiago Fernandez
Ranch Hand
Posts: 167
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks everyone! I've taken the following solution for the problem I was having anyways:



Cheers!
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic