Category Archives: Hadoop

Sqoop and the misleading Oracle errors

I ran into a misleading error from the Oracle JDBC driver while using Sqoop. The configuration was Sqoop 1.4.6 using ojdbc7.jar and connecting to Oracle. JDK 8 support was a must.I wasn’t sure of the Oracle version number so I used the most current JDBC driver version. The command:

sqoop list-tables --connect oracle.jdbc.driver:@oracle_host_name:1521:oracle_SID --username xxx --password

would fail with an odd error.

ERROR manager.SqlManager: Error reading database metadata: java.sql.SQLException: ORA-00604: error occurred at recursive SQL level 1 ORA-01882: timezone region not found

Adding the following options for setting any timezone properties, prior to –connect, did not help.
-Duser.timezone="EST" -Doracle.session.TimeZone='America/New York'

The next variable to consider was the database driver. I downloaded a different version of the driver that is for 11G. I then removed the options for the timezone properties and ran the test agin. Success! It was all about the Oracle version AND the driver.

Backward compatibility for the Oracle JDBC driver seems to be a bit overrated ¯\_(ツ)_/¯

Oracle 11g R2: ojdbc6.jar – Supports JDK 6, 7, and 8 (JDK 8 support is kinda recent)
Oracle 12c R1: ojdbc7.jar – Supports JDK 7 and 8

Ambari Blueprints

Ambari is an provisioning and management tool for Hadoop clusters and Hortonworks, Pivotal and IBM are among the committers and contributers to the Apache Ambari project. One of the newer and powerful features of Ambari is the Blueprint. An Ambari blueprint provides the layout and configuration of a cluster much like a building architecture blueprint. Instantiating a cluster from a blueprint also requires a cluster template which will associate hostnames to hostgroup placeholders in the blueprint. Below are the basic REST API commands for extracting a blueprint from an existing Ambari-managed cluster.

The form of the REST command when using curl:
curl -H "X-Requested-By: ambari" -X GET -u : ://:/api/v1/clusters/?format=blueprint

Example blueprint request:
curl -H "X-Requested-By: ambari" -X GET -u admin:admin http://ambari.client.com:8080/api/v1/clusters/prod1?format=blueprint

Use the “-k” option for the HTTPS protocol:
curl -k -H "X-Requested-By: ambari" -X GET -u admin:admin https://ambari.client.com:8443/api/v1/clusters/prod1?format=blueprint

Groupnames with pdsh

pdsh is one of my favorite utilities for fiddling with Hadoop clusters. The parallel distributed shell fans commands out to the machines you name on the command line with the -w options, e.g. pdsh -w “server1 server2 server3″ “ls -l”. There is also support for wildcarding and this allows you to refer to the machines in the example with a shorthand syntax of server[1-3]. You can even exclude machines by using a -x “servername” option.

These features are great but typing all those server names over and over gets a bit tedious even in the short form. That is when you should you start using groupnames. You can create a file in the ~/.dsh/group directory, or in the /etc/dsh/group directory. You will name the file as the groupname you want to create and place a newline separated list of machine names in the file. For example, the file ~/.dsh/group/all could contain a list of all the files in your cluster and you would invoke it as pdsh -g all ls to run an ls command on each server in the group. You can still exclude some machines with the -x option, or an entire groupname with the -X option.

Handy SSH command line options

Updating your ~/.ssh/known_hosts file each time you SSH into a new machine can be a hassle. The following is way to ssh into a machine without receiving a bothersome prompt to add a new host to your known_hosts file. Invoke the command like this:

ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null

The setting for the first option will allow connections to machines that are not in the known_hosts file. Your second option will then pipe all that good information about the new host into oblivion and squelch another bothersome prompt. This combination is particularly useful when you are executing an SSH command in parallel across a large number of servers utilizing PDSH or a custom script. The “Are you sure you want to continue connecting (yes/no)?” will be suppressed and the command action will continue.

If you want to persist these options without needing, or being allowed, to edit /etc/ssh/ssh_config you can create a ~/.ssh/ssh_config file. Enter the settings into the text file without a leading “-o”. This file is read each time the user starts an SSH session and the settings will be applied by simply typing ssh.

Passwordless SSH

Passwordless SSH is a must in Hadoop and I’ve used a tried and true method for some time. First you generate a key-pair using ssh-keygen, then push the public portion of the key to the target hosts with this command: cat .ssh/id_rsa.pub | ssh hostname 'cat >> .ssh/authorized_keys' You can see that the last part of the process is a bit opaque.

But today the command was not working for me [NOTE: It was probably me who wasn’t working correctly. Eh?] A few web searches turned up an Ubuntu forum thread where bodhi.zazen posted a different method. This newly found gem is ssh-copy-id which replaces the cat/pipe/redirect with a more succinct command. You must have the OpenSSH package installed though, which will bum out the Mac OS X users. The crisp command will execute like so:
ssh-copy-id -i .ssh/id_rsa.pub

That’s it. Nice and clean. As a bonus, the command also makes the .ssh directory on the target if it is not there. So if you are in the mood to save a few keystrokes, this command is for you.

Installing Accumulo 1.5 with HDP 1.3

Apache Accumulo is a highly-scaleable NoSQL datastore originally developed by the U.S. National Security Agency. The project was released into the Apache open source community in 2011 as an incubator project and graduated to a top level project in 2012. The Hortonworks Data Platform 1.3 is an Hadoop distribution containing Hadoop 1.2 and other tools in the Hadoop ecosystem. While HDP 1.3 does include the NoSQL datastore HBase, you may also install and configure Accumulo to operate with the platform. This post will demonstrate how to install and configure Accumlulo 1.5 to work with an existing HDP 1.3 cluster.

There are a couple of other perquisites as well. A functional zookeeper quorum must be installed and available for Accumulo to operate. Zookeeper is included with HDP 1.3 and may be installed and configured via Ambari. You must also have sufficient system privileges (e.g. root) for the installation. The discussion below assumes this is not an issue.

Configuration and Installation

Download the generic binary tarball of Apache Accumulo 1.5 at http://accumulo.apache.org/downloads/ and get ready for the fun.

Unzip the downloaded tarball to /usr/lib on the servers hosting the Accumulo master and tablet servers. The README file in the root of the tarball contains installation on configuration and installation. The steps in this post fill in the gaps for HDP 1.3.  When you see $ACCUMULO_HOME below, expand that to the installation directory you chose.

Create an account with an ID of ‘accumulo’ and a group of ‘hadoop’ and each Accumulo server. HDP creates the ‘hadoop’ group and places the ‘hdfs’ user in this group during installation. Creating a separate account for accumulo will clearly separate the roles while having a common group makes things a bit less restrictive.
useradd -d $ACCUMULO_HOME -g hadoop accumulo

Update the permissions on the Accumulo home directory
chmod -R accumulo:hadoop $ACCUMULO_HOME

On each accumulo server, cd to $ACCUMULO_HOME/server/src/main/c++ and type ‘make’. This will build the native libraries for the proper architecture.

Set up passwordless SSH for the accumulo user from the Accumulo Master to each of the tablet servers. You can do this by executing ssh-keygen as the accumulo user and then appending the public key you generate to the authorized_keys file on each server.

Now create a couple of HDFS directories. First, create the HDFS /accumulo directory and change the owner to accumulo with a group of hdfs. Then create the /user/accumulo directory in HDFS and also set the permissions as accumulo:hdfs.
hadoop dfs -mkdir /accumulo
hadoop dfs -chown -R accumlo /accumulo
hadoop dfs -mkdir /user/accumulo
hadoop dfs -chown -R accumulo /user/accumulo

Update the hdfs-site.xml settings using Ambari. Add the property ‘dfs.durable.sync‘ with a value of ‘true‘. You will need to stop the MapReduce and HDFS services before making this change.

Depending on the amount of memory you can allocate to Accumulo, select a mostly configured set of configuration files from the conf/examples directory. The example below will use 3 GB of RAM for the accumulo processes.
cp $ACCUMULO_HOME/conf/examples/3GB/native-standalone/*.conf  $ACCUMULO_HOME/conf

Edit the slaves file in $ACCUMULO_HOME/conf/ and enter in the fully qualified domain name of each tablet server host.

Edit the masters file in $ACCUMULO_HOME/conf/ and enter in the fully qualified domain name of the accumulo master host.

Modify $ACCUMULO_HOMEconf/accumulo-env.sh to these values.
JAVA_HOME=/usr/java/default
HADOOP_HOME=/usr/lib/hadoop
ZOOKEEPER_HOME=/usr/lib/zookeeper
ACCUMULO_LOG_DIR=/var/log/accumulo

Modify the gc, monitor and traces files to use the FQDN of the accumulo master. The default setting is localhost

Create the $ACCUMULO_LOG_DIR on every machine you entered into the slaves file.
mkdir -p /var/log/accumulo

Setup the property for the zookeeper quorum now. Edit $ACCUMULO_HOME/conf/accumulo-site.xml and change the data value for instance.zookeeper.host to contain a comma separated list of the hosts. The host names are in the format of host:port and there are no spaces after the commas separating the hosts. e.g. host1:2181,host2:2182,host3:2181 The default port for zookeeper is 2181 and if you are using that port number, you can slide by with just the host name.

Make the accumulo-site.xml read-only for the accumulo user by performing a chmod 600 accumulo-site.xml

The accumulo configuration files should be identical across all the accumulo servers so you will need to push the updates out to each accumulo master and table server.

Initialize Accumulo

Become the accumulo user now.
su - accumulo

Run $ACCUMULO_HOME/bin/accumulo init to create the hdfs directory structure (hdfs:///accumulo/*) and initial zookeeper settings. This will also allow you to also configure the initial root password. Only do this once.  [not the system root password. you choose this one.]  You are prompted for the following.

Instance name:
Root password:
Confirm root password:

Run Accumulo

Once you have finished the initialization, you are ready to start Accumulo. Execute $ACCUMULO_HOME/bin/start-all.sh to get things going.

If things went well you can navigate your browser to http://<accumulo master>:50095 for the Accumulo status page

Other Potential Tweaks
If you see a complaint in the Accumulo logs about a low open file limit, you can update the number using the following commands.

Edit /etc/security/limits.conf and add these entries:
accumulo  hard  nofile  65536
accumulo soft    nofile  65536

And if you see an odd message about “swampiness” you can execute ”sysctl -w vm.swappiness=0″ as root on each accumulo server.