I'm trying to learn Java, XML and regex at the same time. I need this to create a program that will parse a mixed content XML file and produce another. The input files are big documents and the output files are the same documents with rewritten markup based on textual content. My biggest challenge seems to be to identify textual patterns that may cross node boundarys. A node could be an element, a comment, a processing instruction and so on. Of course elements are not created equal and most can have attributes that I need to retain and possibly add to newly created elements.
Most likely I will use DOM since I need to do some look-ahead and perhaps also look-behind to recognize patterns and where they start and end. DOM also seems to be a good choice with mixed content (an element can contain text and child elements in any order and recursively). Feel free to try to convince me there is a better alternative to DOM!
I have also looked at XPath. I can see that it is powerful but I don't see how it could help me.
I have found some examples and a little bit of tutorial information but most tackle rather simple problems. What I would like to get is pointers to XML parsing and construction examples that could give me more ideas and inspiration to learn good techniques for handling semi-complex cases.