Apache Accumulo is a highly-scaleable NoSQL datastore originally developed by the U.S. National Security Agency. The project was released into the Apache open source community in 2011 as an incubator project and graduated to a top level project in 2012. The Hortonworks Data Platform 1.3 is an Hadoop distribution containing Hadoop 1.2 and other tools in the Hadoop ecosystem. While HDP 1.3 does include the NoSQL datastore HBase, you may also install and configure Accumulo to operate with the platform. This post will demonstrate how to install and configure Accumlulo 1.5 to work with an existing HDP 1.3 cluster.
There are a couple of other perquisites as well. A functional zookeeper quorum must be installed and available for Accumulo to operate. Zookeeper is included with HDP 1.3 and may be installed and configured via Ambari. You must also have sufficient system privileges (e.g. root) for the installation. The discussion below assumes this is not an issue.
Configuration and Installation
Download the generic binary tarball of Apache Accumulo 1.5 at http://accumulo.apache.org/downloads/ and get ready for the fun.
Unzip the downloaded tarball to /usr/lib on the servers hosting the Accumulo master and tablet servers. The README file in the root of the tarball contains installation on configuration and installation. The steps in this post fill in the gaps for HDP 1.3. When you see $ACCUMULO_HOME below, expand that to the installation directory you chose.
Create an account with an ID of ‘accumulo’ and a group of ‘hadoop’ and each Accumulo server. HDP creates the ‘hadoop’ group and places the ‘hdfs’ user in this group during installation. Creating a separate account for accumulo will clearly separate the roles while having a common group makes things a bit less restrictive.
useradd -d $ACCUMULO_HOME -g hadoop accumulo
Update the permissions on the Accumulo home directory
chmod -R accumulo:hadoop $ACCUMULO_HOME
On each accumulo server, cd to $ACCUMULO_HOME/server/src/main/c++ and type ‘make’. This will build the native libraries for the proper architecture.
Set up passwordless SSH for the accumulo user from the Accumulo Master to each of the tablet servers. You can do this by executing ssh-keygen as the accumulo user and then appending the public key you generate to the authorized_keys file on each server.
Now create a couple of HDFS directories. First, create the HDFS /accumulo directory and change the owner to accumulo with a group of hdfs. Then create the /user/accumulo directory in HDFS and also set the permissions as accumulo:hdfs.
hadoop dfs -mkdir /accumulo
hadoop dfs -chown -R accumlo /accumulo
hadoop dfs -mkdir /user/accumulo
hadoop dfs -chown -R accumulo /user/accumulo
Update the hdfs-site.xml settings using Ambari. Add the property ‘dfs.durable.sync‘ with a value of ‘true‘. You will need to stop the MapReduce and HDFS services before making this change.
Depending on the amount of memory you can allocate to Accumulo, select a mostly configured set of configuration files from the conf/examples directory. The example below will use 3 GB of RAM for the accumulo processes.
cp $ACCUMULO_HOME/conf/examples/3GB/native-standalone/*.conf $ACCUMULO_HOME/conf
Edit the slaves file in $ACCUMULO_HOME/conf/ and enter in the fully qualified domain name of each tablet server host.
Edit the masters file in $ACCUMULO_HOME/conf/ and enter in the fully qualified domain name of the accumulo master host.
Modify $ACCUMULO_HOMEconf/accumulo-env.sh to these values.
Modify the gc, monitor and traces files to use the FQDN of the accumulo master. The default setting is localhost
Create the $ACCUMULO_LOG_DIR on every machine you entered into the slaves file.
mkdir -p /var/log/accumulo
Setup the property for the zookeeper quorum now. Edit $ACCUMULO_HOME/conf/accumulo-site.xml and change the data value for instance.zookeeper.host to contain a comma separated list of the hosts. The host names are in the format of host:port and there are no spaces after the commas separating the hosts. e.g. host1:2181,host2:2182,host3:2181 The default port for zookeeper is 2181 and if you are using that port number, you can slide by with just the host name.
Make the accumulo-site.xml read-only for the accumulo user by performing a
chmod 600 accumulo-site.xml
The accumulo configuration files should be identical across all the accumulo servers so you will need to push the updates out to each accumulo master and table server.
Become the accumulo user now.
su - accumulo
$ACCUMULO_HOME/bin/accumulo init to create the hdfs directory structure (hdfs:///accumulo/*) and initial zookeeper settings. This will also allow you to also configure the initial root password. Only do this once. [not the system root password. you choose this one.] You are prompted for the following.
Confirm root password:
Once you have finished the initialization, you are ready to start Accumulo. Execute
$ACCUMULO_HOME/bin/start-all.sh to get things going.
If things went well you can navigate your browser to http://<accumulo master>:50095 for the Accumulo status page
Other Potential Tweaks
If you see a complaint in the Accumulo logs about a low open file limit, you can update the number using the following commands.
Edit /etc/security/limits.conf and add these entries:
accumulo hard nofile 65536
accumulo soft nofile 65536
And if you see an odd message about “swampiness” you can execute ”
sysctl -w vm.swappiness=0″ as root on each accumulo server.