File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes XML and Related Technologies and the fly likes Problem with dom4j Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of The Java EE 7 Tutorial Volume 1 or Volume 2 this week in the Java EE forum
or jQuery UI in Action in the JavaScript forum!
JavaRanch » Java Forums » Engineering » XML and Related Technologies
Bookmark "Problem with dom4j "Getting started" example" Watch "Problem with dom4j "Getting started" example" New topic
Author

Problem with dom4j "Getting started" example

D Swart
Greenhorn

Joined: Nov 07, 2008
Posts: 12
The following code is taken from the dom4j website, http://dom4j.sourceforge.net/download.html

It gives a org.dom4j.DocumentException, and I have no clue why. The code is taken from their example: http://dom4j.sourceforge.net/dom4j-1.6.1/guide.html
and is the first "This is easy to do" example.

Any help most appreciated.

The full error is: org.dom4j.DocumentException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd Nested exception: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
10.5.4 503 Service Unavailable

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

Note: The existence of the 503 status code does not imply that a
server must use it when becoming overloaded. Some servers may wish
to simply refuse the connection.


"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." --- Martin Fowler
Please correct my English.
D Swart
Greenhorn

Joined: Nov 07, 2008
Posts: 12
Thanks Wouter,

The problem is it says that no matter what I set the URL to (I have tried several). Also, it seems to be saying that about some URL which is not www.apache.org.

So I guess how do I make it look at www.apache.org?

Cheers.
Wouter Oet
Saloon Keeper

Joined: Oct 25, 2008
Posts: 2700

I'm not sure why it doesn't work for apache.org.
I've tried google and that works fine. Remember that HTML is not valid XML.
So the sax parser throws exceptions while trying to parse the file.
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 38466
    
  23
Moving thread.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

What do you mean, the URL with the problem is an apache.org URL? Read the error message again:
The full error is: org.dom4j.DocumentException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd Nested exception: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd


Anyway the answer to this particular error message is in this mail list entry:

http://lists.w3.org/Archives/Public/site-comments/2009Jun/0009.html

In other words, the W3C (the people who defined the standards for XHTML) decided that every XHTML document would refer to a DTD which they specified. They also (probably without thinking too hard about the consequences) decided that the DTD would be hosted on their site. That meant that every single application in the world which parsed an XHTML document would have to go to their site to get that DTD.

Of course a responsible application (like your browser for example) will only get the DTD once, then it will cache it for future use and not go to the W3C site again. But your application isn't a responsible one, as you can see from the remarks in that link about "Java". So the W3C is basically telling your application to get lost, it doesn't have time for you.

Of course you wouldn't think it's every single Java application's responsibility to cache URLs properly. You would think it's the JVM's responsibility to do that on behalf of the applications. But apparently it doesn't do that.

So if you're just doing this to get experience with XML, I recommend you stay away from XHTML pages until you have enough experience to set up an XML catalog or a caching proxy.
D Swart
Greenhorn

Joined: Nov 07, 2008
Posts: 12
Thank you - a very useful answer.

If I do want experience with just DOM parsing/ HTMLpage traversal, can you recommend a tool?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

What do you mean by "a tool"?
D Swart
Greenhorn

Joined: Nov 07, 2008
Posts: 12
Hmmm .... good question. I mean something which enables me to think at a higher level of abstraction.

I really just want to get the job done, where "the job" is read in and parse web pages. Anything that helps me do so is good - the more easily it lets me do this, the better. Does that make sense?
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

If you're still thinking of something written in Java code for this tool, then you still have the same problem. However the page I linked to has a little hint at a workaround:
request the DTD resources through it from a user-agent other than one that vaguely identifies itself as Java

So if you use something which requests the page and sets the User-Agent header to something which, say, claims to be Firefox, you might get away with it. I generally use Apache HttpClient to access data over HTTP, as it means I don't have to learn as many details of the HTTP protocol as I would if I tried to code the access myself.

Edit: I just noticed that the page I linked to also says
and apparently if using Apache libraries there is a catalog solution in it

which if it means what I think it means (that the Apache code caches those DTDs for you) would be even better. But I'm just guessing about that.
 
Consider Paul's rocket mass heater.
 
subject: Problem with dom4j "Getting started" example