aspose file tools*
The moose likes Hadoop and the fly likes Extract data from Hadoop File system using nutch Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Databases » Hadoop
Bookmark "Extract data from Hadoop File system using nutch" Watch "Extract data from Hadoop File system using nutch" New topic
Author

Extract data from Hadoop File system using nutch

syruss kumar
Ranch Hand

Joined: Jul 23, 2009
Posts: 93
Hi,

I’m newbie to nutch.I have installed and configured nutch to crawl the site.I want to extract the data from the crawl db .Is there any way to get the data programmatically?

Thanks in advance,

All search starts with beginner's luck and all search ends with victor's severly tested.
syruss kumar
Ranch Hand

Joined: Jul 23, 2009
Posts: 93
Hi all,

Here is the solution. Use Nutch api to extract the data.Under crawl/segment folder it placed the content,parsed text,parsed data etc.

Sample code to read data from hadoop file system using Nutch 1.6 api



parin jogani
Greenhorn

Joined: Apr 06, 2013
Posts: 1
Thank you! of great help.
Any way to extract a particular file format only (eg. pdf)?
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Extract data from Hadoop File system using nutch
 
Similar Threads
Did SCJP->left Coding in Java->project in NUTCH-> Confused
Nutch -> Report all domain links but follow just a sublist
Java-based Collective Intelligence
How To Read Html Page Opened In Browser Using Java Program
Searching a crawler