This week's book giveaways are in the Jython/Python and Object-Oriented programming forums. We're giving away four copies each of Machine Learning for Business: Using Amazon SageMaker and Jupyter and Object Design Style Guide and have the authors on-line! See this thread and this one for details.
I am an academician and studying Hadoop for my presentations in class. As I am new to Hadoop, I seek your expert opinions on following two aspects:
1. Which languages are popularly used by industries to implement Hadoop? Python or Java.
2. Any sources, from where I can get real business scenarios / examples in Industry where Hadoop is being used? Also where the data sets are available.
During my course presentation, I need this information to convince my students. Kindly help me.
Hadoop itself is implemented mainly in Java as far as I know, and there is a fairly low-level Java API which a lot of people have used for Hadoop programming. However, it is often easier to use higher-level APIs such as Cascading (for Java) or alternative languages like Pig, Hive SQL and others. Pig is a Hadoop-based scripting language, and scripts are converted internally into a series of MapReduce tasks. Hive is a way to manage your data in HDFS as if it were held in relational database tables, and you can use SQL to manipulate your data, which is much easier than trying to do this in Java/MapReduce. As with Pig, the SQL is converted into MapReduce tasks underneath. Hadoop is also the foundation for other tools such as the NoSQL database HBase.
However, Hadoop v.2+ now provides the YARN resource manager, which allows you to plug in alternative processing engines e.g. Tez or Spark instead of the older MapReduce engine. Using these engines can speed up your Hive SQL or Pig jobs significantly. Apache Spark is a distributed processing engine that can run independently or on top of Hadoop's YARN engine. Spark has APIs for Scala, Python and Java, and provides a powerful high-level coding paradigm that many people are starting to see as an alternative to traditional Java/MapReduce with/without Hadoop. One of the nice things about Spark is that you can code your whole data-processing pipeline using the same language/API and a consistent programming model, instead of having to switch between e.g. Java, Pig and Hive SQL to complete different stages in the processing.
Many other languages and tools (e.g. ETL tools, BI, etc) provide interfaces of various kinds to Hadoop, and it seems to be getting easier to use Hadoop as a distributed data store, but use lots of other tools to access/manipulate your data in Hadoop, even if you are not executing your code directly on Hadoop's processing engine.