File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Extract data from Hadoop File system using nutch

 
syruss kumar
Ranch Hand
Posts: 104
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I’m newbie to nutch.I have installed and configured nutch to crawl the site.I want to extract the data from the crawl db .Is there any way to get the data programmatically?

Thanks in advance,
 
syruss kumar
Ranch Hand
Posts: 104
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

Here is the solution. Use Nutch api to extract the data.Under crawl/segment folder it placed the content,parsed text,parsed data etc.

Sample code to read data from hadoop file system using Nutch 1.6 api



 
parin jogani
Greenhorn
Posts: 1
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you! of great help.
Any way to extract a particular file format only (eg. pdf)?
 
I agree. Here's the link: http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic