Friday, September 6, 2013

Adventures in Hadoop: #2 Starting from Scratch

As I had mentioned in my previous post, I wanted to collect my experience from various sources. This post is intended as a continuing series of posts in which I’d like to share my learning experience in Hadoop.

Disclaimer: I have used Michael Noll’s article as a base for my own exercise. I tend to deviate slightly from his instructions and will expand in later posts

Add Software Repositories

maazanjum@hadoop:~$ sudo apt-get install python-software-properties
[sudo] password for maazanjum: 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  python-software-properties
0 upgraded, 1 newly installed, 0 to remove and 212 not upgraded.
Need to get 19.1 kB of archives.
After this operation, 132 kB of additional disk space will be used.
Get:1 http://us.archive.ubuntu.com/ubuntu/ raring-updates/universe python-software-properties all 0.92.17.1 [19.1 kB]
Fetched 19.1 kB in 0s (78.6 kB/s)               
Selecting previously unselected package python-software-properties.
(Reading database ... 155358 files and directories currently installed.)
Unpacking python-software-properties (from .../python-software-properties_0.92.17.1_all.deb) ...
Setting up python-software-properties (0.92.17.1) ...

maazanjum@hadoop:~$ sudo add-apt-repository ppa:ferramroberto/java
You are about to add the following PPA to your system:
 PPA esclusivo per l'ultima versione disponibile di JAVA

PPA for the latest version of JAVA

PPA für die neueste Version von JAVA

PPA para la última versión de JAVA

PPA pour la dernière version de JAVA


by LffL http://www.lffl.org

 More info: https://launchpad.net/~ferramroberto/+archive/java
Press [ENTER] to continue or ctrl-c to cancel adding it

gpg: keyring `/tmp/tmpslw4kg/secring.gpg' created
gpg: keyring `/tmp/tmpslw4kg/pubring.gpg' created
gpg: requesting key 3ACC3965 from hkp server keyserver.ubuntu.com
gpg: /tmp/tmpslw4kg/trustdb.gpg: trustdb created
gpg: key 3ACC3965: public key "Launchpad lffl" imported
gpg: no ultimately trusted keys found
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
OK

Update Source List

maazanjum@hadoop:~$ sudo apt-get update
[sudo] password for maazanjum:
Ign http://ppa.launchpad.net raring Release.gpg
Ign http://ppa.launchpad.net raring Release  
                                                               

Install Sun Java 6 JDK

maazanjum@hadoop:~$ sudo apt-get install sun-java6-jdk
Reading package lists... Done
Building dependency tree      
Reading state information... Done
Package sun-java6-jdk is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'sun-java6-jdk' has no installation candidate

According to Happy Coding, Sun JDK has been removed from the partner archives; therefore I’ll use the OpenJDK version instead.

maazanjum@hadoop:~$ sudo apt-get install openjdk-6-jdk
Reading package lists... Done
Building dependency tree      
Reading state information... Done
The following extra packages will be installed:
  ca-certificates-java icedtea-6-jre-cacao icedtea-6-jre-jamvm icedtea-netx icedtea-netx-common java-common
  libatk-wrapper-java libatk-wrapper-java-jni libgif4 libice-dev libnss3-1d libpthread-stubs0 libpthread-stubs0-dev libsm-dev
  libx11-6 libx11-dev libx11-doc libxau-dev libxcb1 libxcb1-dev libxdmcp-dev libxt-dev libxt6 openjdk-6-jre
Unpacking openjdk-6-jdk:amd64 (from .../openjdk-6-jdk_6b27-1.12.6-1ubuntu0.13.04.2_amd64.deb) ...
Processing triggers for ca-certificates ...
Updating certificates in /etc/ssl/certs... 0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d....done.
Processing triggers for doc-base ...
Processing 32 changed doc-base files, 2 added doc-base files...
Processing triggers for man-db ...
Processing triggers for bamfdaemon ...
done.
done.
maazanjum@hadoop:~$

Add a Hadoop system user

maazanjum@hadoop:~$ sudo addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
maazanjum@hadoop:~$ id hadoop
id: hadoop: no such user
maazanjum@hadoop:~$ sudo adduser --ingroup hadoop hduser
Adding user `hduser' ...
Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for hduser
Enter the new value, or press ENTER for the default
      Full Name []: Hadoop User
      Room Number []:
      Work Phone []:
      Home Phone []:
      Other []:
Is the information correct? [Y/n] y

Configure SSH

maazanjum@hadoop:~$ su - hduser
Password:
hduser@hadoop:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
c6:e3:f1:a4:43:02:21:20:96:7f:29:ce:16:8d:ef:6a hduser@hadoop
The key's randomart image is:
+--[ RSA 2048]----+
|ooo .            |
|o. . .           |
|  . + .          |
|   = = .         |
|  o = . S .      |
|   + . = *       |
|  . .   + .      |
|   E .   .       |
|  ...            |
+-----------------+

hduser@hadoop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

I had to install and configure SSH before the next few steps worked.

hduser@hadoop:~$ sudo apt-get install openssh-server

I also granted hduser sudo access

maazanjum@hadoop:~$ sudo adduser hduser sudo
[sudo] password for maazanjum:
Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.

hduser@hadoop:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is 27:8e:e7:97:72:a2:08:5e:b2:4e:95:91:61:34:72:3a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 13.04 (GNU/Linux 3.8.0-19-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

Disable IPv6

hduser@hadoop:~$ sudo vi /etc/sysctl.conf
[sudo] password for hduser:
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Reboot the machine and check whether its actually disabled.

hduser@hadoop:~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
1

Install Hadoop


hduser@hadoop:~$ cd /usr/local
hduser@hadoop:/usr/local$ ls -lhtr
total 32K
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 src
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 sbin
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 include
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 games
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 etc
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 bin
drwxr-xr-x 4 root root 4.0K Apr 24 10:04 lib
drwxr-xr-x 7 root root 4.0K Apr 24 10:05 share
lrwxrwxrwx 1 root root    9 Sep  2 22:31 man -> share/man

I use the –C option to output tar to a specific directory

hduser@hadoop:/usr/local$ sudo tar -zxvf /tmp/hadoop-1.2.1.tar.gz -C /usr/local
hadoop-1.2.1/
hadoop-1.2.1/.eclipse.templates/
hadoop-1.2.1/.eclipse.templates/.externalToolBuilders/
hadoop-1.2.1/.eclipse.templates/.launches/
hadoop-1.2.1/bin/
hadoop-1.2.1/c++/
hadoop-1.2.1/c++/Linux-amd64-64/

I personally prefer to link the binaries to a generic “hadoop” folder.

hduser@hadoop:/usr/local$ sudo ln -s hadoop-1.2.1/ hadoop
hduser@hadoop:/usr/local$ ls -lhtr
total 36K
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 src
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 sbin
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 include
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 games
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 etc
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 bin
drwxr-xr-x 4 root      root      4.0K Apr 24 10:04 lib
drwxr-xr-x 7 root      root      4.0K Apr 24 10:05 share
drwxr-xr-x 9 maazanjum maazanjum 4.0K Aug 15 22:15 hadoop-1.2.1
lrwxrwxrwx 1 root      root         9 Sep  2 22:31 man -> share/man
lrwxrwxrwx 1 root      root        19 Sep  3 12:24 hadoop -> hadoop-1.2.1/

Change ownership to hduser.

hduser@hadoop:/usr/local$ sudo chown -R hduser:hadoop hadoop*
hduser@hadoop:/usr/local$ ls -lhtr
total 36K
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 src
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 sbin
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 include
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 games
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 etc
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 bin
drwxr-xr-x  4 root   root   4.0K Apr 24 10:04 lib
drwxr-xr-x  7 root   root   4.0K Apr 24 10:05 share
drwxr-xr-x 15 hduser hadoop 4.0K Jul 22 15:26 hadoop-1.2.1
lrwxrwxrwx  1 root   root      9 Sep  2 22:31 man -> share/man
lrwxrwxrwx  1 hduser hadoop   13 Sep  3 13:14 hadoop -> hadoop-1.2.1/

Configure .bashrc file

I appended this to my .profile file.

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Since the article I followed details a single-node setup, I will follow with it for now. Later posts will detail a multimode setup.

Configuring Hadoop


hadoop-env.sh

Edit the string below

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

Change it to

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64

conf/*-site.xml

I’ll leave the defaults and create the working temporary directory for hadoop.

hduser@hadoop:/usr/local/hadoop/conf$ sudo mkdir -p /app/hadoop/tmp
hduser@hadoop:/usr/local/hadoop/conf$ sudo chown hduser:hadoop /app/hadoop/tmp
hduser@hadoop:/usr/local/hadoop/conf$ sudo chmod 750 /app/hadoop/tmp

Follow instructions in this article to configure the *-site.xml files.

Getting Started


Format the HDFS from the NameNode

hduser@hadoop:~$ /usr/local/hadoop/bin/hadoop namenode -format
Warning: $HADOOP_HOME is deprecated.

13/09/03 13:36:28 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop/192.168.182.133
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.6.0_27
************************************************************/
Re-format filesystem in /app/hadoop/tmp/dfs/name ? (Y or N) y
Format aborted in /app/hadoop/tmp/dfs/name
13/09/03 13:36:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.182.133
************************************************************/

Startup

hduser@hadoop:~$ /usr/local/hadoop/bin/start-all.sh
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-hadoop.out
localhost: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hadoop.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-hadoop.out
starting jobtracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-hadoop.out
localhost: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hadoop.out

Check whether Hadoop is up and running.

hduser@hadoop:~$ jps
2565 DataNode
3218 Jps
2311 NameNode
2797 SecondaryNameNode
3102 TaskTracker
2873 JobTracker

Shutdown

hduser@hadoop:/usr/local/hadoop$ bin/stop-all.sh
Warning: $HADOOP_HOME is deprecated.

stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

In the next article, I will go over some of the examples I’ve found that are useful to understand questions Hadoop can help answer.

Cheers!




No comments :

Post a Comment