Sqoop and the misleading Oracle errors

I ran into a misleading error from the Oracle JDBC driver while using Sqoop. The configuration was Sqoop 1.4.6 using ojdbc7.jar and connecting to Oracle. JDK 8 support was a must.I wasn’t sure of the Oracle version number so I used the most current JDBC driver version. The command:

sqoop list-tables --connect oracle.jdbc.driver:@oracle_host_name:1521:oracle_SID --username xxx --password

would fail with an odd error.

ERROR manager.SqlManager: Error reading database metadata: java.sql.SQLException: ORA-00604: error occurred at recursive SQL level 1 ORA-01882: timezone region not found

Adding the following options for setting any timezone properties, prior to –connect, did not help.
-Duser.timezone="EST" -Doracle.session.TimeZone='America/New York'

The next variable to consider was the database driver. I downloaded a different version of the driver that is for 11G. I then removed the options for the timezone properties and ran the test agin. Success! It was all about the Oracle version AND the driver.

Backward compatibility for the Oracle JDBC driver seems to be a bit overrated ¯\_(ツ)_/¯

Oracle 11g R2: ojdbc6.jar – Supports JDK 6, 7, and 8 (JDK 8 support is kinda recent)
Oracle 12c R1: ojdbc7.jar – Supports JDK 7 and 8

Ambari Blueprints

Ambari is an provisioning and management tool for Hadoop clusters and Hortonworks, Pivotal and IBM are among the committers and contributers to the Apache Ambari project. One of the newer and powerful features of Ambari is the Blueprint. An Ambari blueprint provides the layout and configuration of a cluster much like a building architecture blueprint. Instantiating a cluster from a blueprint also requires a cluster template which will associate hostnames to hostgroup placeholders in the blueprint. Below are the basic REST API commands for extracting a blueprint from an existing Ambari-managed cluster.

The form of the REST command when using curl:
curl -H "X-Requested-By: ambari" -X GET -u : ://:/api/v1/clusters/?format=blueprint

Example blueprint request:
curl -H "X-Requested-By: ambari" -X GET -u admin:admin http://ambari.client.com:8080/api/v1/clusters/prod1?format=blueprint

Use the “-k” option for the HTTPS protocol:
curl -k -H "X-Requested-By: ambari" -X GET -u admin:admin https://ambari.client.com:8443/api/v1/clusters/prod1?format=blueprint

Installing Phoenix into the HDP 2.2 Sandbox

Perform the following as root
yum install phoenix

If the HDP repo times out with yum, copy the Phoenix RPM to the Sandobx and perform an install with rpm -i
OR
modify /etc/yum.conf, add the line “timeout=999″ and then perform a “yum clean all”

You will need to update some HBase settings now.

Set/modify the following properties:

<property>
   <name>hbase.regionserver.wal.codec</name>
   <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
</property>

<property>
   <name>hbase.region.server.rpc.scheduler.factory.class</name>
   <value>org.apache.phoenix.hbase.index.ipc.PhoenixIndexRpcSchedulerFactory</value>
</property>

<property>
   <name>phoenix.query.timeoutMs</name>
   <value>600000</value>
   <!-- this is a really big value -->
</property>

<property>
   <name>hbase.rpc.timeout</name>
   <value>360000000</value>
<!-- this is a really big value -->
</property>

<property>
   <name>hbase.client.scanner.timeout.period</name>
   <value>6000000</value>
<!-- this is a really big value -->
</property>

<property>
   <name>hbase.regionserver.lease.period</name>
   <value>6000000</value>
   <!-- this is a really big value -->
</property>

Running the Phoenix JDBC-based CLI (sqlline.py)

First, find out the value for the HBase Zookeeper root. Look in /etc/hbase/conf/hbase-site.xml for the zookeeper.znode.parent value.

<property>
   <name>zookeeper.znode.parent</name>
   <value>/hbase-unsecure</value>
</property>

prompt> su – hbase
prompt> cd /usr/hdp/current/hbase-client
prompt> bin/sqlline.py localhost:2181:/hbase-unsecure

Optional: Create a user to execute the Phoenix commands so you don’t have to be the hbase user

useradd phuser -g users #the group users already existed

Update the new user’s environment so they can execute the Phoenix JDBC client.

Change /home/phuser/.bash_profile to add this line
PHOENIX_PATH=/usr/hdp/current/phoenix-client/bin
and update the path to this
PATH=$PATH:$HOME/bin:$PHOENIX_PATH

But I wanted to use the bash shell!

Pissed at your Linux administrator because they gave you the Bourne shell as your default? Tired of type “bash” after every login? Use this technique to change your shell to Bash.

Modify, or create, a .profile file in your home directory

prompt> mkdir -p ~/.profile

And place this statement as the last line in your .profile

SHELL=/bin/bash exec /bin/bash –login

You should be able to bash with no effort at all now.

The info in this post is pulled from this page.

P.S. This should apply to any other shell supported on your system as well

Groupnames with pdsh

pdsh is one of my favorite utilities for fiddling with Hadoop clusters. The parallel distributed shell fans commands out to the machines you name on the command line with the -w options, e.g. pdsh -w “server1 server2 server3″ “ls -l”. There is also support for wildcarding and this allows you to refer to the machines in the example with a shorthand syntax of server[1-3]. You can even exclude machines by using a -x “servername” option.

These features are great but typing all those server names over and over gets a bit tedious even in the short form. That is when you should you start using groupnames. You can create a file in the ~/.dsh/group directory, or in the /etc/dsh/group directory. You will name the file as the groupname you want to create and place a newline separated list of machine names in the file. For example, the file ~/.dsh/group/all could contain a list of all the files in your cluster and you would invoke it as pdsh -g all ls to run an ls command on each server in the group. You can still exclude some machines with the -x option, or an entire groupname with the -X option.

Handy SSH command line options

Updating your ~/.ssh/known_hosts file each time you SSH into a new machine can be a hassle. The following is way to ssh into a machine without receiving a bothersome prompt to add a new host to your known_hosts file. Invoke the command like this:

ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null

The setting for the first option will allow connections to machines that are not in the known_hosts file. Your second option will then pipe all that good information about the new host into oblivion and squelch another bothersome prompt. This combination is particularly useful when you are executing an SSH command in parallel across a large number of servers utilizing PDSH or a custom script. The “Are you sure you want to continue connecting (yes/no)?” will be suppressed and the command action will continue.

If you want to persist these options without needing, or being allowed, to edit /etc/ssh/ssh_config you can create a ~/.ssh/ssh_config file. Enter the settings into the text file without a leading “-o”. This file is read each time the user starts an SSH session and the settings will be applied by simply typing ssh.

Change JDK in /etc/alternatives

I installed JDK 1.6.0_31 on a Linux server with an existing OpenJDK 1.6 install. The version from Oracle is preferred, and more stable, so I wanted to switch versions. Since the installation is managed, for lack of a better term, in /etc/alternatives changing the binary path wasn’t the way to go. Here are the commands to change the entries in /etc/alternatives.

First, add the reference to the new java via the –install option
alternatives --install /usr/bin/java java /usr/jdk64/jdk1.6.0_31/bin/java 3
The “3″ at the end of the command represents the priority of the command in the list maintained by alternatives. For a new install, this is set to 1.

Then run the following command and select the version you want to use. In our case, this was the third option.
alternatives --config java

I then updated /etc/alternatives for javac and jar as well.
alternatives --install /usr/bin/javac javac /usr/jdk64/jdk1.6.0_31/bin/javac 1
alternatives --install /usr/bin/jar jar /usr/jdk64/jdk1.6.0_31/bin/jar 1

Sed capturing group

Ever wanted to replace the text in the middle of a string based upon a pattern match? Yeah, me too.

You can define capturing groups in a sed expression which will help with the match but not get in the way of the substitution. This page from StackOverflow provides a great example of how this is done.

I wanted to update the configurations for some of the disks defined in /etc/fstab to use the noatime and nodiratime directives. The defaults setting was already in place and the new directives would be appended to the current setting. This needed to be done across 32 servers, so manual editing was not going to happen. I used the statement below to perform the update.

sed -i 's/\(^.*data.*\)defaults/\1defaults,noatime,nodiratime/' /etc/fstab

In this case, I wanted to find all lines that contained the word “data” and “defaults”. The space between the two words could be any number of characters. The portion of the statement in parentheses defines the capturing group which represented the matching pattern. The parens had to be escaped with a “\” character or an error would be returned.

The replacement pattern is defined next and it is proceeded by a 1. You can use multiple capturing groups in an expression and they are referenced numerically in their order of appearance. So if you had another capturing group following the first, you would refer to it with a 2. You must also escape the reference number or an error will occur.

When the command was executed, the data file went from this:

UUID=abcd /data/01 ext4 defaults 1 2
UUID=efgh /data/02 ext4 defaults 1 2
UUID=hijk /data/03 ext4 defaults 1 2
UUID=lmno /data/04 ext4 defaults 1 2

to this:

UUID=abcd /data/01 ext4 defaults,noatime,nodiratime 1 2
UUID=efgh /data/02 ext4 defaults,noatime,nodiratime 1 2
UUID=hijk /data/03 ext4 defaults,noatime,nodiratime 1 2
UUID=lmno /data/04 ext4 defaults,noatime,nodiratime 1 2

P.S. The UUIDs have been changed to protect the innocent

Passwordless SSH

Passwordless SSH is a must in Hadoop and I’ve used a tried and true method for some time. First you generate a key-pair using ssh-keygen, then push the public portion of the key to the target hosts with this command: cat .ssh/id_rsa.pub | ssh hostname 'cat >> .ssh/authorized_keys' You can see that the last part of the process is a bit opaque.

But today the command was not working for me [NOTE: It was probably me who wasn’t working correctly. Eh?] A few web searches turned up an Ubuntu forum thread where bodhi.zazen posted a different method. This newly found gem is ssh-copy-id which replaces the cat/pipe/redirect with a more succinct command. You must have the OpenSSH package installed though, which will bum out the Mac OS X users. The crisp command will execute like so:
ssh-copy-id -i .ssh/id_rsa.pub

That’s it. Nice and clean. As a bonus, the command also makes the .ssh directory on the target if it is not there. So if you are in the mood to save a few keystrokes, this command is for you.