my dog learned polymorphism*
The moose likes Servlets and the fly likes in-container search engine Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Servlets
Bookmark "in-container search engine" Watch "in-container search engine" New topic
Author

in-container search engine

Thomas HJ Goorden
Greenhorn

Joined: Nov 29, 2002
Posts: 6
Hi,
I'm trying to think of a way to set up a search engine that "stays" inside the J2EE container, more specifically that spiders it's own webapplication without going through a firewall set up "in front of" the server.
All our HTML goes through one servlet (older J2EE though, no Filters...), so I was thinking of going straight for the HttpServletRequest and Response.
Scenario:
- create request inside container (no actual webclient involved)
- pass on to doGet for
- "capture" response (again in-container) and run through Lucene
- rip URL-links from output
- create new request from link
- rince, repeat
Issues:
- How to create/spoof a proper Request and/or capture a Response?
For now, I was thinking fo subclassing these and just semi-implement what I need.
- How to create a reasonable representation of the output?
Ideal would be what HttpUnit does: Break up the page in Title, Body Text, Links, Forms (all in nice Java objects). That way, I could get a very nice classifying thing going for Lucene (eg. your own metadata indexed seperately). The problem is HttpUnit acts as a client, so it will use network functionality (URLConnections), which is what I don't want.
- How to convert (HttpUnit/?) links back to meaningfull Requests?
Most HTML parsers I've seen will have their own native Link representation, but I need it in J2EE Request form, preferably with state info as well (Cookies,..).
Advantages of this approach?
- security: most servers don't need port 80 connectivity from the outside, so it can be firewalled. (I've actually seen this with one of our clients) Since spidering uses (and should use) actual server names, it automatically passes through this.
- performance: since you're so close to the web-application, I expect a huge performance gain
- the spider doesn't show up in the logs more realistic use-profiling
Ideas? Suggestions? Anybody that wants to join in?
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
 
subject: in-container search engine
 
Similar Threads
IBM HttpServer and websphere Appserver
XTreme Programming Revelation: Could I make it all MUCH simpler?
Lucene: Search Engine usage/statistics
instance of a servlet
Filtering words from txt file (Jakarta Lucene)