This week's book giveaways are in the Java EE and JavaScript forums.
We're giving away four copies each of The Java EE 7 Tutorial Volume 1 or Volume 2(winners choice) and jQuery UI in Action and have the authors on-line!
See this thread and this one for details.
The moose likes XML and Related Technologies and the fly likes XPATH to REGEX? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of The Java EE 7 Tutorial Volume 1 or Volume 2 this week in the Java EE forum
or jQuery UI in Action in the JavaScript forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "XPATH to REGEX?" Watch "XPATH to REGEX?" New topic
Author

XPATH to REGEX?

Kelly Dolan
Ranch Hand

Joined: Jan 08, 2002
Posts: 109
I have an interesting situation. I need to read/modify/write an XML document without using an XML parser. My approach resulted in treating an XML document as a regular text document, reading the document into a single string, using regular expressions to find what I need, making modifications, etc. and then writing the string back out to the file system.

I have found that the regular expressions get complex very quickly and in some situations (e.g., inserting an element at a specific position) nearly impossible. I know that XPath does a much better job at finding a node in an XML document so I'm now looking into if I can represent my locations, etc. as XPaths (instead of regexes) and then convert them into a set of regular expressions that can be applied in sequence.

Any thoughts or ideas? Has anyone seen a converter-type utility?

Thanks!
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

If I were you I would review the requirements. Perhaps the part of the requirement that forbids using a parser can be disposed of. That would be the best outcome for all concerned -- both you, who have to write the unnecessary regex, the users of the program, who have to wait while you figure out how to do it, and the future maintainers of the code. Really. Use a parser. Why would you not?
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41599
    
  55
+1 on what Paul said. Also because -as you found out already- regexps for this would get tricky quickly, especially if you take all the edge cases into account, like comments, processing instructions and CDATA sections.


Ping & DNS - my free Android networking tools app
Kelly Dolan
Ranch Hand

Joined: Jan 08, 2002
Posts: 109
I really appreciate the feedback. Trust me...if I can avoid this, I would. But...

The users of the application are "authors" that create XML documents. The application is somewhat of a version control system that allows an author to check in/out documents. The XML documents themselves contain metadata that when they are checked in (or out), certain metadata must be updated by the application. [Similar to source code files with $keywords embedded in some comment that get expanded when they are checked into their version control system.] Other than this metadata, the content of the document that is checked out needs to be identical to what was last checked in. There is no getting rid of this requirement.

I played around a pretty long time with XML parsers and could not find any solution that let me read in the XML document and even without making changes, write it back out and be guaranteed the same exact document. If the input document contained entity (or character?) references in the body, they were always resolved in the output document. This is unacceptable to an author - to have lost authored entity references in content. Also, if the schema to which the document adheres declares default attributes, those not specified in the input document show up in the output document. I realize this shouldn't be tragic but it annoys them to all ends...i.e., they didn't put it there...they don't want it there.

Thanks so much!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: XPATH to REGEX?