I have an interesting situation. I need to read/modify/write an XML document without using an XML parser. My approach resulted in treating an XML document as a regular text document, reading the document into a single string, using regular expressions to find what I need, making modifications, etc. and then writing the string back out to the file system.
I have found that the regular expressions get complex very quickly and in some situations (e.g., inserting an element at a specific position) nearly impossible. I know that XPath does a much better job at finding a node in an XML document so I'm now looking into if I can represent my locations, etc. as XPaths (instead of regexes) and then convert them into a set of regular expressions that can be applied in sequence.
Any thoughts or ideas? Has anyone seen a converter-type utility?
If I were you I would review the requirements. Perhaps the part of the requirement that forbids using a parser can be disposed of. That would be the best outcome for all concerned -- both you, who have to write the unnecessary regex, the users of the program, who have to wait while you figure out how to do it, and the future maintainers of the code. Really. Use a parser. Why would you not?
+1 on what Paul said. Also because -as you found out already- regexps for this would get tricky quickly, especially if you take all the edge cases into account, like comments, processing instructions and CDATA sections.
I really appreciate the feedback. Trust me...if I can avoid this, I would. But...
The users of the application are "authors" that create XML documents. The application is somewhat of a version control system that allows an author to check in/out documents. The XML documents themselves contain metadata that when they are checked in (or out), certain metadata must be updated by the application. [Similar to source code files with $keywords embedded in some comment that get expanded when they are checked into their version control system.] Other than this metadata, the content of the document that is checked out needs to be identical to what was last checked in. There is no getting rid of this requirement.
I played around a pretty long time with XML parsers and could not find any solution that let me read in the XML document and even without making changes, write it back out and be guaranteed the same exact document. If the input document contained entity (or character?) references in the body, they were always resolved in the output document. This is unacceptable to an author - to have lost authored entity references in content. Also, if the schema to which the document adheres declares default attributes, those not specified in the input document show up in the output document. I realize this shouldn't be tragic but it annoys them to all ends...i.e., they didn't put it there...they don't want it there.