Category Archives: Linux

Groupnames with pdsh

pdsh is one of my favorite utilities for fiddling with Hadoop clusters. The parallel distributed shell fans commands out to the machines you name on the command line with the -w options, e.g. pdsh -w “server1 server2 server3″ “ls -l”. There is also support for wildcarding and this allows you to refer to the machines in the example with a shorthand syntax of server[1-3]. You can even exclude machines by using a -x “servername” option.

These features are great but typing all those server names over and over gets a bit tedious even in the short form. That is when you should you start using groupnames. You can create a file in the ~/.dsh/group directory, or in the /etc/dsh/group directory. You will name the file as the groupname you want to create and place a newline separated list of machine names in the file. For example, the file ~/.dsh/group/all could contain a list of all the files in your cluster and you would invoke it as pdsh -g all ls to run an ls command on each server in the group. You can still exclude some machines with the -x option, or an entire groupname with the -X option.

Handy SSH command line options

Updating your ~/.ssh/known_hosts file each time you SSH into a new machine can be a hassle. The following is way to ssh into a machine without receiving a bothersome prompt to add a new host to your known_hosts file. Invoke the command like this:

ssh -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null

The setting for the first option will allow connections to machines that are not in the known_hosts file. Your second option will then pipe all that good information about the new host into oblivion and squelch another bothersome prompt. This combination is particularly useful when you are executing an SSH command in parallel across a large number of servers utilizing PDSH or a custom script. The “Are you sure you want to continue connecting (yes/no)?” will be suppressed and the command action will continue.

If you want to persist these options without needing, or being allowed, to edit /etc/ssh/ssh_config you can create a ~/.ssh/ssh_config file. Enter the settings into the text file without a leading “-o”. This file is read each time the user starts an SSH session and the settings will be applied by simply typing ssh.

Change JDK in /etc/alternatives

I installed JDK 1.6.0_31 on a Linux server with an existing OpenJDK 1.6 install. The version from Oracle is preferred, and more stable, so I wanted to switch versions. Since the installation is managed, for lack of a better term, in /etc/alternatives changing the binary path wasn’t the way to go. Here are the commands to change the entries in /etc/alternatives.

First, add the reference to the new java via the –install option
alternatives --install /usr/bin/java java /usr/jdk64/jdk1.6.0_31/bin/java 3
The “3″ at the end of the command represents the priority of the command in the list maintained by alternatives. For a new install, this is set to 1.

Then run the following command and select the version you want to use. In our case, this was the third option.
alternatives --config java

I then updated /etc/alternatives for javac and jar as well.
alternatives --install /usr/bin/javac javac /usr/jdk64/jdk1.6.0_31/bin/javac 1
alternatives --install /usr/bin/jar jar /usr/jdk64/jdk1.6.0_31/bin/jar 1

Sed capturing group

Ever wanted to replace the text in the middle of a string based upon a pattern match? Yeah, me too.

You can define capturing groups in a sed expression which will help with the match but not get in the way of the substitution. This page from StackOverflow provides a great example of how this is done.

I wanted to update the configurations for some of the disks defined in /etc/fstab to use the noatime and nodiratime directives. The defaults setting was already in place and the new directives would be appended to the current setting. This needed to be done across 32 servers, so manual editing was not going to happen. I used the statement below to perform the update.

sed -i 's/\(^.*data.*\)defaults/\1defaults,noatime,nodiratime/' /etc/fstab

In this case, I wanted to find all lines that contained the word “data” and “defaults”. The space between the two words could be any number of characters. The portion of the statement in parentheses defines the capturing group which represented the matching pattern. The parens had to be escaped with a “\” character or an error would be returned.

The replacement pattern is defined next and it is proceeded by a 1. You can use multiple capturing groups in an expression and they are referenced numerically in their order of appearance. So if you had another capturing group following the first, you would refer to it with a 2. You must also escape the reference number or an error will occur.

When the command was executed, the data file went from this:

UUID=abcd /data/01 ext4 defaults 1 2
UUID=efgh /data/02 ext4 defaults 1 2
UUID=hijk /data/03 ext4 defaults 1 2
UUID=lmno /data/04 ext4 defaults 1 2

to this:

UUID=abcd /data/01 ext4 defaults,noatime,nodiratime 1 2
UUID=efgh /data/02 ext4 defaults,noatime,nodiratime 1 2
UUID=hijk /data/03 ext4 defaults,noatime,nodiratime 1 2
UUID=lmno /data/04 ext4 defaults,noatime,nodiratime 1 2

P.S. The UUIDs have been changed to protect the innocent

Passwordless SSH

Passwordless SSH is a must in Hadoop and I’ve used a tried and true method for some time. First you generate a key-pair using ssh-keygen, then push the public portion of the key to the target hosts with this command: cat .ssh/id_rsa.pub | ssh hostname 'cat >> .ssh/authorized_keys' You can see that the last part of the process is a bit opaque.

But today the command was not working for me [NOTE: It was probably me who wasn’t working correctly. Eh?] A few web searches turned up an Ubuntu forum thread where bodhi.zazen posted a different method. This newly found gem is ssh-copy-id which replaces the cat/pipe/redirect with a more succinct command. You must have the OpenSSH package installed though, which will bum out the Mac OS X users. The crisp command will execute like so:
ssh-copy-id -i .ssh/id_rsa.pub

That’s it. Nice and clean. As a bonus, the command also makes the .ssh directory on the target if it is not there. So if you are in the mood to save a few keystrokes, this command is for you.