Hello fellows,
My company (insurance) wants to be informed of what is going on around the world as per "insurance" concerned. They want me to setup a system which can gather all the news links from various online newspapers around the world about "social insurance", "public health", "cancer" and some other keywords in a daily base. We already have a 30 node hadoop cluster setup already. So far I examined Nutch & Solar. My first question is do you think I can achieve this with these tools?
https://wiki.apache.org/nutch/FrontPage[1]
Also when system fetches a link, how will I know it is today's news? I mean boss wants me to bring fresh news in front of him and publish daily. How can I differentiate yesterday's news and today's news?
Can you give me direction? Thanks in advance...