aspose file tools*
The moose likes Other Open Source Projects and the fly likes Nutch -> Report all domain links but follow just a sublist Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Nutch -> Report all domain links but follow just a sublist" Watch "Nutch -> Report all domain links but follow just a sublist" New topic
Author

Nutch -> Report all domain links but follow just a sublist

Gabriel Solano
Greenhorn

Joined: Jul 19, 2010
Posts: 5
I just started exploring nutch for crawling a certain list of domains. What I want to do is to follow all the links from a specific domain: "domainx.com".
That is easy to configure in the:



But when I run the command to create the links database:

bin/nutch readlinkdb crawl/linkdb -dump links

I realized that I only get the links from the domain filter. I want the crawler to report all available links contained in the domain I configured, including the one outside the domain but not following them. So if www.coderanch.com is contained inside domainx.com/index.html, I want that link to be reported but not crawled. Hope I'm explaining myself.

Thanks!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Nutch -> Report all domain links but follow just a sublist