Friday, September 6, 2013
Adventures in Hadoop: #2 Starting from Scratch
As I had mentioned in my previous post,
I wanted to collect my experience from various sources. This post is intended
as a continuing series of posts in which I’d like to share my learning experience
in Hadoop.
Disclaimer: I have used Michael Noll’s article
as a base for my own exercise. I tend to deviate slightly from his instructions
and will expand in later posts
Add Software
Repositories
maazanjum@hadoop:~$ sudo apt-get install
python-software-properties
[sudo] password for maazanjum:
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
python-software-properties
0 upgraded, 1 newly installed, 0 to remove and 212 not
upgraded.
Need to get 19.1 kB of archives.
After this operation, 132 kB of additional disk space will be
used.
Get:1 http://us.archive.ubuntu.com/ubuntu/
raring-updates/universe python-software-properties all 0.92.17.1 [19.1 kB]
Fetched 19.1 kB in 0s (78.6 kB/s)
Selecting previously unselected package
python-software-properties.
(Reading database ... 155358 files and directories currently
installed.)
Unpacking python-software-properties (from
.../python-software-properties_0.92.17.1_all.deb) ...
Setting up python-software-properties (0.92.17.1) ...
maazanjum@hadoop:~$ sudo add-apt-repository
ppa:ferramroberto/java
You are about to add the following PPA to your system:
PPA esclusivo per l'ultima versione disponibile di JAVA
PPA for the latest version of JAVA
PPA für die neueste Version von JAVA
PPA para la última versión de JAVA
PPA pour la dernière version de JAVA
by LffL http://www.lffl.org
More info:
https://launchpad.net/~ferramroberto/+archive/java
Press [ENTER] to continue or ctrl-c to cancel adding it
gpg: keyring `/tmp/tmpslw4kg/secring.gpg' created
gpg: keyring `/tmp/tmpslw4kg/pubring.gpg' created
gpg: requesting key 3ACC3965 from hkp server
keyserver.ubuntu.com
gpg: /tmp/tmpslw4kg/trustdb.gpg: trustdb created
gpg: key 3ACC3965: public key "Launchpad lffl"
imported
gpg: no ultimately trusted keys found
gpg: Total number processed: 1
gpg: imported:
1 (RSA: 1)
OK
Update Source List
maazanjum@hadoop:~$ sudo apt-get update
[sudo] password for maazanjum:
Ign http://ppa.launchpad.net raring Release.gpg
Ign http://ppa.launchpad.net raring Release
…
Install Sun Java 6
JDK
maazanjum@hadoop:~$ sudo apt-get install sun-java6-jdk
Reading package lists... Done
Building dependency tree
Reading state information... Done
Package sun-java6-jdk is not available, but is referred to by
another package.
This may mean that the package is missing, has been obsoleted,
or
is only available from another source
E: Package 'sun-java6-jdk' has no installation candidate
According to Happy
Coding, Sun JDK has been removed from the partner archives; therefore I’ll
use the OpenJDK version instead.
maazanjum@hadoop:~$ sudo apt-get install openjdk-6-jdk
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
ca-certificates-java
icedtea-6-jre-cacao icedtea-6-jre-jamvm icedtea-netx icedtea-netx-common
java-common
libatk-wrapper-java
libatk-wrapper-java-jni libgif4 libice-dev libnss3-1d libpthread-stubs0
libpthread-stubs0-dev libsm-dev
libx11-6 libx11-dev
libx11-doc libxau-dev libxcb1 libxcb1-dev libxdmcp-dev libxt-dev libxt6
openjdk-6-jre
…
Unpacking openjdk-6-jdk:amd64 (from
.../openjdk-6-jdk_6b27-1.12.6-1ubuntu0.13.04.2_amd64.deb) ...
Processing triggers for ca-certificates ...
Updating certificates in /etc/ssl/certs... 0 added, 0 removed;
done.
Running hooks in /etc/ca-certificates/update.d....done.
Processing triggers for doc-base ...
Processing 32 changed doc-base files, 2 added doc-base files...
Processing triggers for man-db ...
Processing triggers for bamfdaemon ...
…
done.
done.
maazanjum@hadoop:~$
Add a Hadoop system
user
maazanjum@hadoop:~$ sudo addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
maazanjum@hadoop:~$ id hadoop
id: hadoop: no such user
maazanjum@hadoop:~$ sudo adduser --ingroup hadoop hduser
Adding user `hduser' ...
Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for hduser
Enter the new value, or press ENTER for the default
Full Name []: Hadoop
User
Room Number []:
Work Phone []:
Home Phone []:
Other []:
Is the information correct? [Y/n] y
Configure SSH
maazanjum@hadoop:~$ su - hduser
Password:
hduser@hadoop:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
c6:e3:f1:a4:43:02:21:20:96:7f:29:ce:16:8d:ef:6a hduser@hadoop
The key's randomart image is:
+--[ RSA 2048]----+
|ooo . |
|o. . . |
| . + . |
| = = . |
| o = . S . |
| + . = * |
| . . + .
|
| E . .
|
| ... |
+-----------------+
hduser@hadoop:~$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
I had to install and configure SSH before the next few steps
worked.
hduser@hadoop:~$ sudo apt-get install openssh-server
I also granted hduser sudo access
maazanjum@hadoop:~$ sudo adduser hduser sudo
[sudo] password for maazanjum:
Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.
hduser@hadoop:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be
established.
ECDSA key fingerprint is
27:8e:e7:97:72:a2:08:5e:b2:4e:95:91:61:34:72:3a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of
known hosts.
Welcome to Ubuntu 13.04 (GNU/Linux 3.8.0-19-generic x86_64)
* Documentation: https://help.ubuntu.com/
Disable IPv6
hduser@hadoop:~$ sudo vi /etc/sysctl.conf
[sudo] password for hduser:
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Reboot the machine and check whether its actually disabled.
hduser@hadoop:~$ cat
/proc/sys/net/ipv6/conf/all/disable_ipv6
1
Install Hadoop
hduser@hadoop:~$ cd /usr/local
hduser@hadoop:/usr/local$ ls -lhtr
total 32K
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 src
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 sbin
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 include
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 games
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 etc
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 bin
drwxr-xr-x 4 root root 4.0K Apr 24 10:04 lib
drwxr-xr-x 7 root root 4.0K Apr 24 10:05 share
lrwxrwxrwx 1 root root
9 Sep 2 22:31 man -> share/man
I use the –C option to output tar to a specific directory
hduser@hadoop:/usr/local$ sudo tar -zxvf /tmp/hadoop-1.2.1.tar.gz
-C /usr/local
hadoop-1.2.1/
hadoop-1.2.1/.eclipse.templates/
hadoop-1.2.1/.eclipse.templates/.externalToolBuilders/
hadoop-1.2.1/.eclipse.templates/.launches/
hadoop-1.2.1/bin/
hadoop-1.2.1/c++/
hadoop-1.2.1/c++/Linux-amd64-64/
…
I personally prefer to link the binaries to a generic
“hadoop” folder.
hduser@hadoop:/usr/local$ sudo ln -s hadoop-1.2.1/ hadoop
hduser@hadoop:/usr/local$ ls -lhtr
total 36K
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01 src
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01 sbin
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01
include
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01 games
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01 etc
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01 bin
drwxr-xr-x 4 root
root 4.0K Apr 24 10:04 lib
drwxr-xr-x 7 root
root 4.0K Apr 24 10:05 share
drwxr-xr-x 9 maazanjum maazanjum 4.0K Aug 15 22:15 hadoop-1.2.1
lrwxrwxrwx 1 root
root 9 Sep 2 22:31 man -> share/man
lrwxrwxrwx 1 root
root 19 Sep 3 12:24 hadoop -> hadoop-1.2.1/
Change ownership to hduser.
hduser@hadoop:/usr/local$ sudo chown -R hduser:hadoop hadoop*
hduser@hadoop:/usr/local$ ls -lhtr
total 36K
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 src
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 sbin
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 include
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 games
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 etc
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 bin
drwxr-xr-x 4 root root
4.0K Apr 24 10:04 lib
drwxr-xr-x 7 root root
4.0K Apr 24 10:05 share
drwxr-xr-x 15 hduser hadoop 4.0K Jul 22 15:26 hadoop-1.2.1
lrwxrwxrwx 1 root root
9 Sep 2 22:31 man -> share/man
lrwxrwxrwx 1 hduser hadoop 13 Sep
3 13:14 hadoop -> hadoop-1.2.1/
Configure .bashrc
file
I appended this to my .profile file.
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for
Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64
# Some convenient aliases and functions for running
Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster
and
# compress job outputs with LZOP (not covered in this
tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 |
lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
Since the article I followed details a single-node setup, I
will follow with it for now. Later posts will detail a multimode setup.
Configuring Hadoop
hadoop-env.sh
Edit the string below
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
Change it to
# The java implementation to use. Required.
export
JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64
conf/*-site.xml
I’ll leave the defaults and create the working temporary
directory for hadoop.
hduser@hadoop:/usr/local/hadoop/conf$
sudo mkdir -p /app/hadoop/tmp
hduser@hadoop:/usr/local/hadoop/conf$
sudo chown hduser:hadoop /app/hadoop/tmp
hduser@hadoop:/usr/local/hadoop/conf$
sudo chmod 750 /app/hadoop/tmp
Follow instructions in this article
to configure the *-site.xml files.
Getting Started
Format the HDFS from
the NameNode
hduser@hadoop:~$ /usr/local/hadoop/bin/hadoop namenode -format
Warning: $HADOOP_HOME is deprecated.
13/09/03 13:36:28 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host =
hadoop/192.168.182.133
STARTUP_MSG: args =
[-format]
STARTUP_MSG: version =
1.2.1
STARTUP_MSG: build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152;
compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG: java =
1.6.0_27
************************************************************/
Re-format filesystem in /app/hadoop/tmp/dfs/name ? (Y or N) y
Format aborted in /app/hadoop/tmp/dfs/name
13/09/03 13:36:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.182.133
************************************************************/
Startup
hduser@hadoop:~$ /usr/local/hadoop/bin/start-all.sh
Warning: $HADOOP_HOME is deprecated.
starting namenode, logging to
/usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-hadoop.out
localhost: starting datanode, logging to
/usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hadoop.out
localhost: starting secondarynamenode, logging to
/usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-hadoop.out
starting jobtracker, logging to
/usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-hadoop.out
localhost: starting tasktracker, logging to
/usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hadoop.out
Check whether Hadoop is up and running.
hduser@hadoop:~$ jps
2565 DataNode
3218 Jps
2311 NameNode
2797 SecondaryNameNode
3102 TaskTracker
2873 JobTracker
Shutdown
hduser@hadoop:/usr/local/hadoop$
bin/stop-all.sh
Warning: $HADOOP_HOME is deprecated.
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping
secondarynamenode
In the next article, I will go over some of the examples I’ve
found that are useful to understand questions Hadoop can help answer.
Cheers!
Subscribe to:
Post Comments
(
Atom
)
No comments :
Post a Comment