posted 2 years ago
We have a crawl proces it runs without giving error feedback, but the data it creates is not put into a datastore (in our case elastic search). I am trying to find out where it goes wrong and I am novice but I think the crawl proces fills an object called crawldb? Is that correct that crawldb should be filled and how can I check that this is happening correctly? Or, more general using the crawler, apache and nutch what should be written to where?
I have gotten so far that I found this in a log file:
INFO crawl.Injector: Injector: crawlDb: /user/crawler/crawl/nutch-data/NL/crawldb
And in hadoop-hdfs I can see this: