• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
  • Mikalai Zaikin

Do I need to learn Hadoop first to learn Apache Spark?

Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Which one is better to learn -- Hadoop or Hadoop with Spark, Scala and Storm? I am confused please suggest me.
Posts: 1210
Android Python PHP C++ Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Let's get Scala out of the way first. Unlike the other three which are data processing technologies, Scala is a general purpose programming language.
You don't need to learn Scala to work with any of those technologies.
That said, Scala is still an interesting language with unique concepts and approaches, and I think you should learn it just to expand your mind to different possibilities.

Storm is also a bit of a different beast. Its focus is on near real time, low latency processing of streaming data as it comes in.
Certain kinds of data should be processed immediately.
This is in contrast to batch processing where data is first stored somewhere and then processed later in bulk.

For example, a meteorologist wants to know right now the probability of a storm (no pun intended!) coming in based on current weather sensor readings.
She can't wait 24 hours to collect data and then bulk process it, because then it'll be too late.

That leaves us with Hadoop vs Spark as batch processing solutions. Both do have near real time streaming solutions of their own (with Spark being better at it), but it's not their focus.
The thing is, there is no "vs." at all here.
Because Hadoop is a big ecosystem with components like HDFS for storage and YARN for cluster resource allocation which are used by Spark deployments.
So you can't use Spark in enterprise without also using some components of the Hadoop ecosystem.
Where Spark excels is performance, simply because it aggregates intermediate results in memory instead of on disk like Hadoop.

I've explained what their individual focus is. Now it's upto you to decide what exactly are your data processing goals and then choose the tools.
Much more important than learning the tools is learning data processing algorithms like clustering and prediction.
If you have no specific goal and you have enough time, learn all of them.
You can thank my dental hygienist for my untimely aliveness. So tiny:
a bit of art, as a gift, that will fit in a stocking
    Bookmark Topic Watch Topic
  • New Topic