Installing Hadoop 0.21.0 on Windows – Spaces in username gotcha

A quick side note about installing Hadoop on Windows. My Windows username is “Ben Hall” – this will be used by hadoop in two places – one being ${user.name} and the other being $USER in shell scripts. 

As hadoop is cross platform, it doesn’t expect to see spaces in the path names resulting in random errors.

The first one I received was while creating the DFS directory.

11/01/16 17:54:04 WARN common.Util: Path file:///tmp/hadoop-Ben Hall/dfs/name should be specified as a URI in 
configuration files. Please update hdfs configuration.
11/01/16 17:54:04 ERROR common.Util: Error while processing URI: file:///tmp/hadoop-Ben Hall/dfs/name
java.io.IOException: The filename, directory name, or volume label syntax is incorrect
        at java.io.WinNTFileSystem.canonicalize0(Native Method)
        at java.io.Win32FileSystem.canonicalize(Win32FileSystem.java:396)
        at java.io.File.getCanonicalPath(File.java:559)
        at java.io.File.getCanonicalFile(File.java:583)
        at org.apache.hadoop.hdfs.server.common.Util.fileAsURI(Util.java:78)
        at org.apache.hadoop.hdfs.server.common.Util.stringAsURI(Util.java:65)
        at org.apache.hadoop.hdfs.server.common.Util.stringCollectionAsURIs(Util.java:91)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getStorageDirs(FSNamesystem.java:378)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNamespaceDirs(FSNamesystem.java:349)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1223)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1348)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368)

Not the most helpful error message. After some pondering and looking at the hadoop code-base, I realised it was due to the space in my username. To solve the problem I created an override of the default tmp path of the directory without the space. Within core-site.xml, I added the following property node:


  hadoop.tmp.dir
  /tmp/hadoop-BenHall

This allowed me to proceed until I hit the following error when starting the nodes:

C:hadoophadoop-0.21.0/bin/hadoop-daemon.sh: line 111: [: /tmp/hadoop-Ben: binary operator expected
C:hadoophadoop-0.21.0/bin/hadoop-daemon.sh: line 67: [: Hall-namenode-BigBlue7.out: integer expression expected
starting namenode, logging to C:hadoophadoop-0.21.0logs/hadoop-Ben
C:hadoophadoop-0.21.0/bin/hadoop-daemon.sh: line 127: $pid: ambiguous redirect
localhost: /cygdrive/c/hadoop/hadoop-0.21.0/bin/hadoop-daemon.sh: line 111: [: /tmp/hadoop-Ben: binary operator expected

Again – not an ideal exception. After looking within the hadoop-daemon.sh script, I found a usage of $USER to build up the log output path. Again, this resulted in a space in the path that caused the exception. To fix it, instead of using the $USER variable I hard-coded the value to BENH.

If you would prefer not to hard-code the value then you could use ‘sed’ to remove spaces dynamically as described here – http://mydebian.blogdns.org/?p=132

After fixing those two errors, I was able to run hadoop without any errors.

Installing Hadoop 0.21.0 on Windows – Installing Hadoop MapReduce

After configuring your SSH server, you should be ready to install Hadoop core.

1) Download

Firstly, you will need to download the package. If you visit the following site then it will help you find your nearest mirror.

http://www.apache.org/dyn/closer.cgi/hadoop/core/

You should then be able to navigate and download Hadoop 0.21.0

hadoop-0.21.0.tar.gz      17-Aug-2010 06:10   71M  

2) Unpack

Once downloaded, simply unpack the tar into a directory of your choice. As Hadoop is cross platform, ensure you extract it to directory without spaces (not Program Files) as otherwise you may experience random errors.

The location I picked was C:hadoop. To extract, run the following command from cygwin:

$ tar xvfz hadoop-0.21.0.tar.gz

3) Configuration

Once extracted, you need to customise three configuration files – core-site.xml, hdfs-site.xml and mapred-site.xml. I used the settings from the quick start guide below, the settings specify which port different subsystems should run on.

http://hadoop.apache.org/common/docs/current/single_node_setup.html#Configuration

We can now start to test our setup. Personally, with any new tool I attempt to output the version first to check the basic system is happy.

In this example, execute:

$ bin/hadoop version

In typical fashion, my machine wasn’t configured correctly and I received the following error:
Error: JAVA_HOME is not set.

This indicates that I had to set JAVA_HOME to my JDK installation.

$ export JAVA_HOME=/cygdrive/c/PROGRA~1/Java/jdk1.6.0_23/ 

When I tried to output the version again I received a nice java stacktrace – at least it meant something had been executed. To solve this there requires some windows only modifications.

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/util/PlatformName
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.util.PlatformName
       at java.net.URLClassLoader$1.run(Unknown Source)
       at java.security.AccessController.doPrivileged(Native Method)
       at java.net.URLClassLoader.findClass(Unknown Source)
       at java.lang.ClassLoader.loadClass(Unknown Source)
       at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
       at java.lang.ClassLoader.loadClass(Unknown Source)
       at java.lang.ClassLoader.loadClassInternal(Unknown Source)

4) Modifications for Windows

Personally, this was the most frustrating section however key to solving the above exception. Thankfully I found the answer on this blog post.

http://juliensimon.blogspot.com/2011/01/installing-hadoop-on-windows-cygwin.html

The problem is that the java classpath can’t find the required Hadoop libsjars due to the way cygwin handles windows file paths.

As described in the post, you need to modify hadoop-config.sh. At around line 181, there is an if statement with the comment “# cygwin path translation

Within this block, add

CLASSPATH=`cygpath -wp "$CLASSPATH"`

Pitfall: It won’t have any affect if you don’t add it within the if block. This took me a good hour to work out before finding the post.

Attempting the command again, you should hopefully see version output similar to below.

$ hadoop version
Hadoop 0.21.0
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21 -r 985326
Compiled by tomwhite on Tue Aug 17 01:02:28 EDT 2010
From source with checksum a1aeb15b4854808d152989ba76f90fac

5) Starting Hadoop

Now is the time to start running Hadoop.

Firstly, we need to setup a temporary directory on our harddrive for hadoop to use while running jobs. By default this will be /tmp/hadoop-${user.name} which maps to C:tmphadoop-${your login name}. If you wish use another directory, modify the hadoop.tmp.dir property in your core-site.xml.

To setup the directory, execute the command.

$ bin/hadoop namenode -format

You can then start hadoop.

$ ./start-all.sh

This will output some information about which nodes have been started. From a developers viewpoint, the most important part will be two new accessible websites.

HDFS Information http://localhost:50070 – This exposes details about the directory we just created

JobTracker http://localhost:50030 – This exposes details about any running or finished jobs

That’s it. Hadoop is now ready to start executing jobs which I’ll discuss in my next post.

Installing Hadoop 0.21.0 on Windows – Setting a SSH server

One area I’m planning to spend time investigating is Hadoop, a mapreduce and distributed computing framework under the Apache foundation banner. Over the next few blog posts I’ll explain how I installed Hadoop 0.21.0 on Windows. This blog post covers the first part – the SSH server.

1) Install Cygwin

Hadoop uses SSH to control jobs running on nodes (machines capable of running hadoop jobs). Each node, including the master, needs to have an SSH server running. Given the lack of SSH support in Windows, you will need to install cygwin.

Pitfall one: The cygwin bash prompt installed as part of msysgit doesn’t include everything. You will need to run the main setup.exe

Once you have downloaded the setup.exe, install the ssh package and it’s dependencies.  I also recommend you install vim as you will need to make config changes at some point.

2) Configure

With the package installed, you can execute the following command to configure the host server.

$ ssh-host-config 

Details around this are explained in more detail at http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-windows/

During this config stage, I was asked two questions.

*** Query: Should privilege separation be used? (yes/no)

*** Query: (Say “no” if it is already installed as a service) (yes/no)

I said no to both.  The first time, I said yes to the first question, which in production systems I would do. As this is just development and learning, I kept it simple.

3) Starting

Once installed, you need to start the sshd server with the following command:

$ /usr/sbin/sshd

This doesn’t give any output so in order to find out of it’s running, execute:

$ ps | grep sshd

This should result in output similar to below.

7416       1    7416       7416    ? 1001 16:08:57 /usr/sbin/sshd 

Pitfall two: If you already have a server running on port 22, sshd won’t error but nothing will be listed in the ps output. Check for other servers on the port if this happens.

4) Logging in

Once installed, it’s time to login.

$ ssh localhost

If it asks you for a passphrase, then you need to run the following commands. The result will mean hadoop can execute without interruption and you requiring to enter a password.

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Note, these ssh keys will be different to your normal windows set at ~/  For example, when using cygwin bash my home directory is:

$ pwd
/home/Ben Hall

This physically maps to:

C:cygwinhomeBen Hall.ssh 

Using msysgit, my home directory is:

$ pwd
/c/Users/Ben Hall

As such, generating ssh keys shouldn’t affect other keys relating to systems such as github. If you already have a key setup, you will only need to execute the command to copy your .pub key into authorized_keys.

With this done, we can move onto the more interesting task of installing Hadoop.