Installing Hadoop 0.21.0 on Windows – Setting a SSH server

One area I’m planning to spend time investigating is Hadoop, a mapreduce and distributed computing framework under the Apache foundation banner. Over the next few blog posts I’ll explain how I installed Hadoop 0.21.0 on Windows. This blog post covers the first part – the SSH server.

1) Install Cygwin

Hadoop uses SSH to control jobs running on nodes (machines capable of running hadoop jobs). Each node, including the master, needs to have an SSH server running. Given the lack of SSH support in Windows, you will need to install cygwin.

Pitfall one: The cygwin bash prompt installed as part of msysgit doesn’t include everything. You will need to run the main setup.exe

Once you have downloaded the setup.exe, install the ssh package and it’s dependencies.  I also recommend you install vim as you will need to make config changes at some point.

2) Configure

With the package installed, you can execute the following command to configure the host server.

$ ssh-host-config 

Details around this are explained in more detail at http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-windows/

During this config stage, I was asked two questions.

*** Query: Should privilege separation be used? (yes/no)

*** Query: (Say “no” if it is already installed as a service) (yes/no)

I said no to both.  The first time, I said yes to the first question, which in production systems I would do. As this is just development and learning, I kept it simple.

3) Starting

Once installed, you need to start the sshd server with the following command:

$ /usr/sbin/sshd

This doesn’t give any output so in order to find out of it’s running, execute:

$ ps | grep sshd

This should result in output similar to below.

7416       1    7416       7416    ? 1001 16:08:57 /usr/sbin/sshd 

Pitfall two: If you already have a server running on port 22, sshd won’t error but nothing will be listed in the ps output. Check for other servers on the port if this happens.

4) Logging in

Once installed, it’s time to login.

$ ssh localhost

If it asks you for a passphrase, then you need to run the following commands. The result will mean hadoop can execute without interruption and you requiring to enter a password.

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Note, these ssh keys will be different to your normal windows set at ~/  For example, when using cygwin bash my home directory is:

$ pwd
/home/Ben Hall

This physically maps to:

C:cygwinhomeBen Hall.ssh 

Using msysgit, my home directory is:

$ pwd
/c/Users/Ben Hall

As such, generating ssh keys shouldn’t affect other keys relating to systems such as github. If you already have a key setup, you will only need to execute the command to copy your .pub key into authorized_keys.

With this done, we can move onto the more interesting task of installing Hadoop.