File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes XML and Related Technologies and the fly likes Parsing XML using SAX without DTD Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Parsing XML using SAX without DTD" Watch "Parsing XML using SAX without DTD" New topic

Parsing XML using SAX without DTD

Markus Kahl

Joined: Apr 05, 2010
Posts: 5

I parse an XML document using SAX with an "empty resolver":

I read the XML document and create another file based on that input.
Everything works perfectly fine except for the entities (© etc.).
Because I use that specific EntityResolver the entities aren't recognized
and so I will never know that there even are such entities in the input file.
Can't I somehow tell the parser to simply treat them as ordinary text?

Of course you now want to know why I do this.
Well all this has to be fast.
With a "proper" EntityResolver using a dtd one turn takes about 10 times longer
than with the "empty resolver". Don't know why but that's how it is.

Any ideas?
Paul Clapham

Joined: Oct 14, 2005
Posts: 19973

So you have an XML document, but it also includes some non-XML entities which require a DTD to describe them? Then yes, the easiest way to deal with this is to make sure the parser knows where the DTD is, so that it expands the entities into XML characters properly. That in fact is how you tell the parser to "treat them as ordinary text".

The next easiest way is to attach an EntityResolver which expands those entities. It appears that you have tried that and it works.

So you have two good solutions, but you're looking for a third? The next easiest way is to write your own XML parser which hard-codes the expansion of those entities. Obviously I don't recommend that. What I would really recommend is not to use the entities in the first place -- there's a perfectly good Unicode character for the © symbol for example -- but if you're stuck with them then you know how to deal with them.

Or is your problem that after the parser does its work, you don't know that some text node was actually an expanded entity? There's no reason why you should care about that, is there?
Markus Kahl

Joined: Apr 05, 2010
Posts: 5
Thanks for your reply.
The problem is solved now.

The root element of the parsed XML file did not contain any xml namespace
and also no doctype, which seemed to cause the problem with the entities
while everything else worked fine.

Now I added the namespace and the doctype and the entities are still not recognized,
but now, unlike before, the callback #skippedEntity(String name) is called
with those entities whereupon I can react accordingly.

So finally everything works, without any DTD.
I agree. Here's the link:
subject: Parsing XML using SAX without DTD
It's not a secret anymore!