I am new to this forum and would be very grateful if anyone could help me solve a problem with HTML files that I have to translate. I have been searching on many forums, for many hours, during many days, but I still haven't found a solution.
This is regarding entities which are "referenced but not declared". I know the problem has been asked many times and I understand what this is about, for example replacing é by é but I am sure there is another way around and since I have to translate more than a hundred files, all containing french entities (é, è, à...) I cannot afford to search and replace all entities in every file, it would take me days to do that...
The files are encoded in UTF 8 and here are the lines I have been trying to add
Could you show us a small example of one of these documents? Right now your post isn't entirely clear to me -- these documents are HTML documents and not XML documents, am I right?
Joined: Sep 06, 2012
Thank you for your reply.
Yes the documents are HTML documents.. I have tried to attach one but apparently there's no way to enclose a file with a .htm extension...
Another way is to declare the entities in an external DTD (or internal, which I have been trying to do...) but still cannot do it...
A dtd (list of entities to declare) can be a .txt. file?
But it looks like you're trying to use XML software to do the translation? Actually I don't see where you described what you were doing at all.
Anyway what I would suggest is to use an HTML parser, one which can read an HTML document into a DOM structure (that's org.w3.dom.Document preferably). Then serialize that DOM into XML.
I'm not sure why you must convert the HTML entities to XML character entities -- why can't you just convert them to the characters themselves? In other words instead of converting "é" to "é" why not just convert it to "é"?