Have an existing Hadoop cluster and would like to install Accumulo, Mahout and some other tools from a separate machine and integrate them into this environment. I can probably stand up some Zookeeper VMs if necessary. Also, when I go to install zookeeper by itself (RHEL 6.4 - yum install zookeeper), it pulls in a copy of Hadoop and seems to want this running on the box (even though I already have namenodes/datanodes on another set of boxen). Installing on a single machine is cake, however, trying to integrate pieces/parts seems to be quite an undertaking.
Here is what I have gleaned thus far:
1. Accumulo NEEDS zookeeper?,
2. Zookeeper seems to want to keep data in memory on a znode (does it EVER write it to the HDFS?),
3. and using MapReduce/Hadoop works great in batch mode.
4. Have thought/tried to install Cloudera/Hortonworks in this environment ... Cloudera only supports RHEL 6.2; HW seems to work ok so far
I am thinking of installing (3) Zookeeper VMs and have them point to the Hadoop Cluster, and then have my Accumulo/Mahout VM point to the Zookeeper ensemble. Is this the best way? Will this ultimately use the Hadoop cluster? Do I need to run a base Hadoop service on all of these boxes to make it all communicate?
Any/all help in this matter is greatly appreciated.
Environment: High Performance Computing infrastructure, VMs/Boxes running RHEL 6.4, all using a private network
"Had Momma Cass and Karen Carpenter shared that ham sandwich, they might both be with us today!"
1. Yup, ZooKeeper is essentially as it keeps bootstrapping state for Accumulo and relies heavily on the locking functionality to coordinate distributed events.
2. As stated above, it runs in memory and uses a local filesystem. ZooKeeper is not dependent on Apache Hadoop's DFS. When running more than one ZooKeeper server together, they are redundant without the use of an external distributed filesystem.
You can certainly use a single ZooKeeper server, but it's up to you the level of redundancy and availability you require for your application. ZooKeeper isn't a very heavy service, so if you have separate nodes, it would be good to run 3 servers. You can easily run it along side nodes which are also tasktrackers and/or datanodes. As far as the location of each service, as they as Accumulo can reach the ZooKeepers, namenode, and datanodes over the same network, you should be fine.
Also, you don't need to run a datanode and tasktracker process on every node; however, you'll most often see this, sans a node or two to run the jobtracker and namenode. It heavily depends on the kind of workload you intend to process.
A word to the wise if you do run Accumulo in VMs, keep in mind that Accumulo is very sensitive to time. Virtualization can skew these sorts of things, so just be cognizant of the actual system resources underneath your VM.
subject: Accumulo, Zookeeper and Hadoop Integration