Win a copy of TDD for a Shopping Website LiveProject this week in the Testing forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
  • Campbell Ritchie
  • Paul Clapham
  • Ron McLeod
  • Jeanne Boyarsky
  • Tim Cooke
  • Liutauras Vilda
  • paul wheaton
  • Henry Wong
Saloon Keepers:
  • Tim Moores
  • Tim Holloway
  • Stephan van Hulst
  • Carey Brown
  • Frits Walraven
  • Piet Souris
  • Himai Minh

6+ Hadoop interview Q&As for Java developers who want a career change into BigData

Posts: 3461
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Q1. What is Hadoop?
A1. Hadoop is an open-source software framework for storing large amounts of data and processing/querying those data on a cluster with multiple nodes of commodity hardware (i.e. low cost hardware). In short, Hadoop consists of

1. HDFS (Hadoop Distributed File System): HDFS allows you to store huge amounts of data in a distributed and a redundant manner. For example, a 1 GB (i.e 1024 MB) text file can be split into 16 * 128MB files and stored on 8 different nodes in a Hadoop cluster. Each split can be replicated 3 times for fault tolerance so that if 1 node goes down, you have backups. HDFS is good for sequential write-once-and-read-many times type access.

2. MapReduce: A computational framework. This processes large amounts of data in a distributed and parallel manner. When you do a query on the above 1 GB file for all users with age > 18, there will be say “8 map” functions running in parallel to extract users with age > 18 within its 128MB split file, and then the “reduce” function will run to combine all the individual outputs into a single final result.

3. YARN (Yet Another Resource Nagotiator): A framework for job scheduling and cluster resource management.

4. Hadoop eco system, with 15+ frameworks & tools like Sqoop, Flume, Kafka, Pig, Hive, Spark, Impala, etc to ingest data into HDFS, to wrangle data (i.e. transform, enrich, aggregate, etc) within HDFS, and to query data from HDFS for business intelligence & analytics. Some tools like Pig & Hive are abstraction layers on top of MapReduce, whilst the other tools like Spark & Impala are improved architecture/design from MapReduce for much improved latencies to support near real-time (i.e. NRT) & real-time processing.

Q2. Why are organizations moving from traditional data warehouse tools to smarter data hubs based on Hadoop eco systems?
A2. Organizations are investing on enhancing their

Existing data infrastructure:

   predominantly using “structured data” stored in high-end & expensive hardwares
   predominantly processed as ETL batch jobs for ingesting data into RDBMS and data warehouse systems for data mining, analysis & reporting to make key business decisions.
   predominantly handle data volumes in gigabytes to terabytes


Smarter data infrastructure based on Hadoop where

   structured (e.g. RDBMS), unstructured (e.g, images, PDFs, docs ), & semi-structured (e.g. logs, XMLs) data can be stored in cheaper commodity machines in a scalable and fault tolerant manner.
   data can be ingested via batch jobs and near real time (i.e. NRT, 200ms to 2 seconds) streaming (e.g. Flume & Kafka).
   data can be queried with low latency (i.e under 100ms) capabilities with tools like Spark & Impala.
   larger data volumes in terabytes to petabytes can be stored.

which empowers organizations to make better business decisions with smarter & bigger data with more powerful tools to ingest data, to wrangle stored data (e.g. aggregate, enrich, transform, etc), and to query the wrangled data with low-latency capabilities for reporting & business intelligence.

Q3. How does a smarter & bigger data hub architectures differ from a traditional data warehouse architectures?

Traditional Enterprise Data Warehouse Architecture:

Hadoop based Data Hub Architecture

More Hadoop interview Questions & Answers with diagrams

Police line, do not cross. Well, this tiny ad can go through:
Free, earth friendly heat - from the CodeRanch trailboss
    Bookmark Topic Watch Topic
  • New Topic