Installing Hadoop 0.21.0 on Windows – Installing Hadoop MapReduce

After configuring your SSH server, you should be ready to install Hadoop core.

1) Download

Firstly, you will need to download the package. If you visit the following site then it will help you find your nearest mirror.

http://www.apache.org/dyn/closer.cgi/hadoop/core/

You should then be able to navigate and download Hadoop 0.21.0

hadoop-0.21.0.tar.gz      17-Aug-2010 06:10   71M  

2) Unpack

Once downloaded, simply unpack the tar into a directory of your choice. As Hadoop is cross platform, ensure you extract it to directory without spaces (not Program Files) as otherwise you may experience random errors.

The location I picked was C:hadoop. To extract, run the following command from cygwin:

$ tar xvfz hadoop-0.21.0.tar.gz

3) Configuration

Once extracted, you need to customise three configuration files – core-site.xml, hdfs-site.xml and mapred-site.xml. I used the settings from the quick start guide below, the settings specify which port different subsystems should run on.

http://hadoop.apache.org/common/docs/current/single_node_setup.html#Configuration

We can now start to test our setup. Personally, with any new tool I attempt to output the version first to check the basic system is happy.

In this example, execute:

$ bin/hadoop version

In typical fashion, my machine wasn’t configured correctly and I received the following error:
Error: JAVA_HOME is not set.

This indicates that I had to set JAVA_HOME to my JDK installation.

$ export JAVA_HOME=/cygdrive/c/PROGRA~1/Java/jdk1.6.0_23/ 

When I tried to output the version again I received a nice java stacktrace – at least it meant something had been executed. To solve this there requires some windows only modifications.

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/util/PlatformName
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.util.PlatformName
       at java.net.URLClassLoader$1.run(Unknown Source)
       at java.security.AccessController.doPrivileged(Native Method)
       at java.net.URLClassLoader.findClass(Unknown Source)
       at java.lang.ClassLoader.loadClass(Unknown Source)
       at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
       at java.lang.ClassLoader.loadClass(Unknown Source)
       at java.lang.ClassLoader.loadClassInternal(Unknown Source)

4) Modifications for Windows

Personally, this was the most frustrating section however key to solving the above exception. Thankfully I found the answer on this blog post.

http://juliensimon.blogspot.com/2011/01/installing-hadoop-on-windows-cygwin.html

The problem is that the java classpath can’t find the required Hadoop libsjars due to the way cygwin handles windows file paths.

As described in the post, you need to modify hadoop-config.sh. At around line 181, there is an if statement with the comment “# cygwin path translation

Within this block, add

CLASSPATH=`cygpath -wp "$CLASSPATH"`

Pitfall: It won’t have any affect if you don’t add it within the if block. This took me a good hour to work out before finding the post.

Attempting the command again, you should hopefully see version output similar to below.

$ hadoop version
Hadoop 0.21.0
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21 -r 985326
Compiled by tomwhite on Tue Aug 17 01:02:28 EDT 2010
From source with checksum a1aeb15b4854808d152989ba76f90fac

5) Starting Hadoop

Now is the time to start running Hadoop.

Firstly, we need to setup a temporary directory on our harddrive for hadoop to use while running jobs. By default this will be /tmp/hadoop-${user.name} which maps to C:tmphadoop-${your login name}. If you wish use another directory, modify the hadoop.tmp.dir property in your core-site.xml.

To setup the directory, execute the command.

$ bin/hadoop namenode -format

You can then start hadoop.

$ ./start-all.sh

This will output some information about which nodes have been started. From a developers viewpoint, the most important part will be two new accessible websites.

HDFS Information http://localhost:50070 – This exposes details about the directory we just created

JobTracker http://localhost:50030 – This exposes details about any running or finished jobs

That’s it. Hadoop is now ready to start executing jobs which I’ll discuss in my next post.

Leave a Reply

Your email address will not be published. Required fields are marked *