Hi, I'm trying to think of a way to set up a search engine that "stays" inside the J2EE container, more specifically that spiders it's own webapplication without going through a firewall set up "in front of" the server. All our HTML goes through one servlet (older J2EE though, no Filters...), so I was thinking of going straight for the HttpServletRequest and Response. Scenario: - create request inside container (no actual webclient involved) - pass on to doGet for - "capture" response (again in-container) and run through Lucene - rip URL-links from output - create new request from link - rince, repeat Issues: - How to create/spoof a proper Request and/or capture a Response? For now, I was thinking fo subclassing these and just semi-implement what I need. - How to create a reasonable representation of the output? Ideal would be what HttpUnit does: Break up the page in Title, Body Text, Links, Forms (all in nice Java objects). That way, I could get a very nice classifying thing going for Lucene (eg. your own metadata indexed seperately). The problem is HttpUnit acts as a client, so it will use network functionality (URLConnections), which is what I don't want. - How to convert (HttpUnit/?) links back to meaningfull Requests? Most HTML parsers I've seen will have their own native Link representation, but I need it in J2EE Request form, preferably with state info as well (Cookies,..). Advantages of this approach? - security: most servers don't need port 80 connectivity from the outside, so it can be firewalled. (I've actually seen this with one of our clients) Since spidering uses (and should use) actual server names, it automatically passes through this. - performance: since you're so close to the web-application, I expect a huge performance gain - the spider doesn't show up in the logs more realistic use-profiling Ideas? Suggestions? Anybody that wants to join in?