Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
JavaRanch.com/granny.jsp
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Extract data from Hadoop File system using nutch

 
syruss kumar
Ranch Hand
Posts: 105
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

I’m newbie to nutch.I have installed and configured nutch to crawl the site.I want to extract the data from the crawl db .Is there any way to get the data programmatically?

Thanks in advance,
 
syruss kumar
Ranch Hand
Posts: 105
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi all,

Here is the solution. Use Nutch api to extract the data.Under crawl/segment folder it placed the content,parsed text,parsed data etc.

Sample code to read data from hadoop file system using Nutch 1.6 api



 
parin jogani
Greenhorn
Posts: 1
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Thank you! of great help.
Any way to extract a particular file format only (eg. pdf)?
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic