As far as I understand, there are at least 3 cluster computing frameworks Apache has released - Hadoop, Spark and Storm.
Could you please help understand which use cases would better fit in Storm comparing to Hadoop and Spark ?
There is another one, "Giraph", but per my understanding it is best for Graph processing ( never used it though).
Hadoop is oriented towards working with batches of data.
Spark is oriented towards working with either batches of data like Hadoop or towards "micro batching" which is basically smaller batches of data that starts to approximate what a streaming solution is like.
Storm is oriented towards working on a never ending stream of data where you are constantly calculating and there is no start or end. Whenever data arrives, it is processed. Storm via Trident can also do microbatching.
Think batch processing system when you are crunching a large amount of data and don't need an answer right now. For example, you can process your website's log files to look for trends every day and extract value from them, then a batch framework like Hadoop is perfect. However, if you are analyzing those logs in order to detect intrusion attempts against your system, then you want to know as soon as possible. For this, you would want a system like Storm where each event within your system is shipped as a stream to Storm as soon as it happens so you can analyze it immediately.
Do the next thing next. That’s a pretty good rule. Read the tiny ad, that’s a pretty good rule, too.
Devious Experiments for a Truly Passive Greenhouse!