• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

condensing xml

 
author & internet detective
Posts: 41860
908
Eclipse IDE VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'm writing a unit test that compares two XML files to see if they are logically the same. (They are formatted differently as shown in this example.) I really want to be able to call assertEquals so the JUnit integration shows the differences between the files. Which means I'm trying to find a solution other than "strip out all the line breaks." That would work, but the whole file would be a long single line string - not very readable.

What I really want is to strip the whitespace from the CDATA section. Or call a method that formats the XML in some way so both strings are formatted the same way.

Is there an easy way to do this? Here's a self contained example. I don't have to use DOM if there is a way to do what I want more easily with a different parser.






 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Well, you know, that white space isn't just a minor formatting issue. It is significant white space, so those two XML documents are actually not the same at all. Which means that if you want to treat them as if they were the same, you can't expect any help from any parser.
 
Jeanne Boyarsky
author & internet detective
Posts: 41860
908
Eclipse IDE VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Paul Clapham wrote:Well, you know, that white space isn't just a minor formatting issue. It is significant white space, so those two XML documents are actually not the same at all. Which means that if you want to treat them as if they were the same, you can't expect any help from any parser.


Nuts! Ok. I've rolled my own then - code below.

They are the same (to me) because one is raw data and one is the browser "helping" by changing the formatting. But you are correct they aren't really the same. Knowing I should stop looking for something that doesn't exist (and start coding) was helpful.

 
Jeanne Boyarsky
author & internet detective
Posts: 41860
908
Eclipse IDE VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
While the above code worked for my example, it didn't work for the real data. I wound up with this algorithm:

- loop through all the elements and look for ones that contained only Text children
- merge the text content of all those children into a StringBuilder
- trim the StringBuilder
- remove all but one of the Text children
- set the trimmed String to that remaining child

I wrote it more efficiently (I don't loop through multiple times), but this is the idea.
 
Bartender
Posts: 1952
7
Eclipse IDE Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'd have used XMLUnit. It's pretty good for this sort of thing.
 
Paul Clapham
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
If you're using the DOM parser (which it looks like you are) then calling the normalize() method on the root will clean up any instances where you have adjacent Text nodes. That would simplify the first two steps of your algorithm.
 
Jeanne Boyarsky
author & internet detective
Posts: 41860
908
Eclipse IDE VI Editor Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Jelle Klap wrote:I'd have used XMLUnit. It's pretty good for this sort of thing.


I wish I knew about that yesterday!

I had a new requirement today. I need to ignore a certain part of the tree. So I would have needed to either parse it or run it through a regexp to remove that part.

I'm glad the program works now. I'm going to add a comment about XMLUnit in case we need to modify the program.
 
Ranch Hand
Posts: 734
7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Oracle xdk offers a package oracle.xml.diff to get a handling on the issue of determining the difference between xml. It exists since 10g and now reaches 12c. This is the guide on its uses from different view of the matter.
12 Determining XML Differences Using Java

It is obvious "difference" is a very entangling and difficult subject depending on how and what one looks for. Say for instance word, much like a free group of association... one can image it is already a rather non-trivial thing to determine what is the difference between two words "abcdef" and "bcdaef". It can be hugely different with every alphabet in the sequence. From another perspect that we normally take, the difference is only a switch of position of the alphabet a... Or one can say how many differences there are for two-alphabet particle "bc" "cd" etc in the two words etc etc... The general setting might possibly unsolvable.

The approach used there try to strike a balance of how and what to look for always within the anchor of xml technologies, a necessity to make thing within handle. I would think the package would still have a big potential and space for further evolving over time if the use is proved needy.
 
Paul Clapham
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
An interesting document! I notice that there's a text node normalization option and the document says:

Text node normalization involves coalescing adjacent text nodes, followed by stripping leading and trailing whitespace from the coalesced nodes. Single text nodes have their leading and trailing whitespace stripped. Whitespace only text nodes are eliminated.



So clearly that's a feature which is much in demand.
 
Don't get me started about those stupid light bulbs.
reply
    Bookmark Topic Watch Topic
  • New Topic