Diff b/w Homogeneous and Heterogeneous Data sets and why simple map reduce is not suitable for relational algebra?and also why Map-Reduce-Merge has been evolved? Proper explanation would be highly appreciated. Thanks in advance, as no one was able to answer it on Quora.
Not sure I understand what you're asking, but here's a few thoughts.
Homogeneous data-sets would probably be the kind of thing that fits nicely into a relational schema i.e. "rectangular" data with a common format, predictable columns, where the data in each column is all the same data-type, and so on.
Heterogeneous data is likely to be data where the structure is unpredictable so you can't (easily) enforce a relational schema, and/or where the data itself might be of different types e.g. text, images, etc. Various NoSQL databases offer alternatives here e.g. MongoDB stores JSON documents with no fixed schema.
Hadoop's Distributed File System (HDFS) can handle all of this data, because it's just a file system and doesn't care what's in each file. However, most real applications need to work with structured data of some kind, and you need at least a key and a value in order to run MapReduce after all. Hadoop's Hive database allows you to define a rectangular table-like structure for files (e.g. CSV) that you have loaded into HDFS, and you can then run SQL queries (no updates) against these tables. The SQL commands are translated into MapReduce steps by the Hive query engine. Alternatively, HBase is a column-family database that sits on top of HDFS, so you have other ways to organise your data, depending on your requirements.
Most relational operations are based on some kind of key e.g. PK look-up, joins on foreign keys etc, but unlike an RDBMS, Hadoop isn't optimised for random access reads i.e. it typically has to do a scan (map/filter) of all your data in a given file or Hive table to find particular records. Also, relational joins require a sort to be performed before you can merge the joined record, and this is another expensive operation in MapReduce if you are dealing with large volumes of data.
This means a simple MapReduce approach is inefficient for most relational operations, unless you are simply reading all the data from a file in no particular order. Of course, you can still implement these SQL-style operations in MapReduce (as in Hive, for example), but it tends to be quite slow. Most serious tasks will require more than a single MapReduce phase, which is also slow because Hadoop's default MapReduce engine writes the intermediate data out to files between MapReduce phases. Various options are available to speed this up e.g. the alternative Tez processing engine, or the Impala SQL engine, which do a lot more in-memory processing to speed up your task execution. Hive SQL is slow, but Impala is pretty fast and may be a reasonable option for interactive SQL queries.
Another option is Apache Spark, which is a general purpose processing engine for distributed systems, and offers a rapidly growing set of tools for reading/writing/transforming data from a variety of data sources using SQL and data-frames (Spark SQL) and/or functional programming (map/reduce etc). Spark turns your process into a DAG of operations which it optimises before execution, and it runs the task in memory as far as possible. The Spark API (Scala, Python, Java) encapsulates all this behind a nice coherent abstraction layer that means you can achieve your goals with far less code in Spark than with traditional MapReduce in Java.
No more Blub for me, thank you, Vicar.
posted 5 years ago
This looks like a job for .... legal tender! It says so right in this tiny ad:
Building a Better World in your Backyard by Paul Wheaton and Shawn Klassen-Koop