Have an existing Hadoop cluster and would like to install Accumulo, Mahout and some other tools from a separate machine and integrate them into this environment. I can probably stand up some Zookeeper VMs if necessary. Also, when I go to install zookeeper by itself (RHEL 6.4 - yum install zookeeper), it pulls in a copy of Hadoop and seems to want this running on the box (even though I already have namenodes/datanodes on another set of boxen). Installing on a single machine is cake, however, trying to integrate pieces/parts seems to be quite an undertaking.
Here is what I have gleaned thus far:
1. Accumulo NEEDS zookeeper?,
2. Zookeeper seems to want to keep data in memory on a znode (does it EVER write it to the HDFS?),
3. and using MapReduce/Hadoop works great in batch mode.
4. Have thought/tried to install Cloudera/Hortonworks in this environment ... Cloudera only supports RHEL 6.2; HW seems to work ok so far
I am thinking of installing (3) Zookeeper VMs and have them point to the Hadoop Cluster, and then have my Accumulo/Mahout VM point to the Zookeeper ensemble. Is this the best way? Will this ultimately use the Hadoop cluster? Do I need to run a base Hadoop service on all of these boxes to make it all communicate?
Any/all help in this matter is greatly appreciated.
Environment: High Performance Computing infrastructure, VMs/Boxes running RHEL 6.4, all using a private network
-Bob-