Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Parsing XML using SAX without DTD

 
Markus Kahl
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey,

I parse an XML document using SAX with an "empty resolver":



I read the XML document and create another file based on that input.
Everything works perfectly fine except for the entities (© etc.).
Because I use that specific EntityResolver the entities aren't recognized
and so I will never know that there even are such entities in the input file.
Can't I somehow tell the parser to simply treat them as ordinary text?

Of course you now want to know why I do this.
Well all this has to be fast.
With a "proper" EntityResolver using a dtd one turn takes about 10 times longer
than with the "empty resolver". Don't know why but that's how it is.

Any ideas?
 
Paul Clapham
Sheriff
Pie
Posts: 20769
30
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
So you have an XML document, but it also includes some non-XML entities which require a DTD to describe them? Then yes, the easiest way to deal with this is to make sure the parser knows where the DTD is, so that it expands the entities into XML characters properly. That in fact is how you tell the parser to "treat them as ordinary text".

The next easiest way is to attach an EntityResolver which expands those entities. It appears that you have tried that and it works.

So you have two good solutions, but you're looking for a third? The next easiest way is to write your own XML parser which hard-codes the expansion of those entities. Obviously I don't recommend that. What I would really recommend is not to use the entities in the first place -- there's a perfectly good Unicode character for the © symbol for example -- but if you're stuck with them then you know how to deal with them.

Or is your problem that after the parser does its work, you don't know that some text node was actually an expanded entity? There's no reason why you should care about that, is there?
 
Markus Kahl
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thanks for your reply.
The problem is solved now.

The root element of the parsed XML file did not contain any xml namespace
and also no doctype, which seemed to cause the problem with the entities
while everything else worked fine.

Now I added the namespace and the doctype and the entities are still not recognized,
but now, unlike before, the callback #skippedEntity(String name) is called
with those entities whereupon I can react accordingly.

So finally everything works, without any DTD.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic