I need some guidelines about how to start with Hadoop. I need a bunch of two things: a book and an installation. Their versions should match Can you give me some advice on this?
I have 32bit Windows 7 with Debian 7 inside my VirtualBox.
Is there a big difference between Hadoop releases? I mean 1x and 2x.
There is a Hadoop download on Apache site. But I have seen many opinions that are saying that direct installation is a big pain. Is this correct? I just want to setup a single-node cluster to play with it.
There are Cloudera packs. But unfortunately they are for 64bit machines, as far as I understand.
There is a Horton sandbox I have been downloading for the last 30 minutes.
What you can recommend me?
And also I need a book that describes the version I am going to install more or less precisely. Hadoop Definitive Guide is from 2012 - is it still up to date? I cannot figure which version of Hadoop it describes.
The Hortonworks Sandbox is the best place to start, as it gives you a pre-packaged installation with all the core Hadoop tools running in a CentOS VM, as well as a set of useful tutorials to get you started. But you'll have problems if you're on a 32-bit machine, as I'm not sure there are any 32-bit Hadoop platforms these days. Hortonworks Sandbox requires 64-bit, and you'll need plenty of RAM if you're running it inside a VM.
Manual installation is a nightmare, and you need several related tools to get a useful installation working - and there are lots of different (and mutually incompatible) versions of all these tools. If your main goal is to find out what you can do with Hadoop, you'll waste a lot of time just trying to install and configure it if you try and do this manually.
Hadoop 2.x is different from Hadoop 1.x as it uses a different model for managing and distributing processes. The newer Hadoop v.2 (YARN) allows you to use other processing engines instead of MapReduce, but this also means you need to make sure you use compatible versions of all your Hadoop libraries and clients etc.
If I were you I would just work through some of the Sandbox tutorials on HDFS, Pig and Hive first, so you can get a feel for what Hadoop does, before you start worrying about configuration, manual installation etc. Also, things change fast so any books from 2+ years ago were probably written 3+ years ago and may be out of date by now.
Is the sandbox installation like an OS installation or does it sit inside an already installed OS?
I understand it all like this, I install Windows OS on my machine, on top of that I install VirtualBox, and now when I boot I get into VirtualBox. Do I need to install another OS on VirtualBox before installing sandbox. I see that there are instruction for Sandbox for Mac and Windows.
The sandbox includes the OS (CentOS I think) so you just need to download the appropriate VM file i.e. for Virtualbox or VMWare Player. Then follow the instructions to load the VM e.g. in Virtualbox. You need a 64-bit host OS and plenty of RAM e.g. 8GB or more.
It may take several minutes to start up all the services, but eventually you should be able to connect to the Hortonworks server via your browser - the VM window should display the IP address to do this. The browser Hue interface allows you to work with HDFS, Pig, Hive, HBase etc, but you can also connect to the VM Linux shell via SSH from your host operating system.
chris webster wrote:The Hortonworks Sandbox is the best place to start, as it gives you a pre-packaged installation with all the core Hadoop tools running in a CentOS VM, as well as a set of useful tutorials to get you started. But you'll have problems if you're on a 32-bit machine, as I'm not sure there are any 32-bit Hadoop platforms these days. Hortonworks Sandbox requires 64-bit, and you'll need plenty of RAM if you're running it inside a VM.
hello chris, i did some of the sandbox tutorial, i didnt find any tutorial for mapreduce. are there any for mapR?
also, sandbox tutorial are too basic i want to go a level up, suggest me some path to get more knowledge in hadoop.