Categories
Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts
Friday, September 6, 2013
Adventures in Hadoop: #2 Starting from Scratch
As I had mentioned in my previous post,
I wanted to collect my experience from various sources. This post is intended
as a continuing series of posts in which I’d like to share my learning experience
in Hadoop.
Disclaimer: I have used Michael Noll’s article
as a base for my own exercise. I tend to deviate slightly from his instructions
and will expand in later posts
Add Software
Repositories
maazanjum@hadoop:~$ sudo apt-get install
python-software-properties
[sudo] password for maazanjum:
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
python-software-properties
0 upgraded, 1 newly installed, 0 to remove and 212 not
upgraded.
Need to get 19.1 kB of archives.
After this operation, 132 kB of additional disk space will be
used.
Get:1 http://us.archive.ubuntu.com/ubuntu/
raring-updates/universe python-software-properties all 0.92.17.1 [19.1 kB]
Fetched 19.1 kB in 0s (78.6 kB/s)
Selecting previously unselected package
python-software-properties.
(Reading database ... 155358 files and directories currently
installed.)
Unpacking python-software-properties (from
.../python-software-properties_0.92.17.1_all.deb) ...
Setting up python-software-properties (0.92.17.1) ...
maazanjum@hadoop:~$ sudo add-apt-repository
ppa:ferramroberto/java
You are about to add the following PPA to your system:
PPA esclusivo per l'ultima versione disponibile di JAVA
PPA for the latest version of JAVA
PPA für die neueste Version von JAVA
PPA para la última versión de JAVA
PPA pour la dernière version de JAVA
by LffL http://www.lffl.org
More info:
https://launchpad.net/~ferramroberto/+archive/java
Press [ENTER] to continue or ctrl-c to cancel adding it
gpg: keyring `/tmp/tmpslw4kg/secring.gpg' created
gpg: keyring `/tmp/tmpslw4kg/pubring.gpg' created
gpg: requesting key 3ACC3965 from hkp server
keyserver.ubuntu.com
gpg: /tmp/tmpslw4kg/trustdb.gpg: trustdb created
gpg: key 3ACC3965: public key "Launchpad lffl"
imported
gpg: no ultimately trusted keys found
gpg: Total number processed: 1
gpg: imported:
1 (RSA: 1)
OK
Update Source List
maazanjum@hadoop:~$ sudo apt-get update
[sudo] password for maazanjum:
Ign http://ppa.launchpad.net raring Release.gpg
Ign http://ppa.launchpad.net raring Release
…
Install Sun Java 6
JDK
maazanjum@hadoop:~$ sudo apt-get install sun-java6-jdk
Reading package lists... Done
Building dependency tree
Reading state information... Done
Package sun-java6-jdk is not available, but is referred to by
another package.
This may mean that the package is missing, has been obsoleted,
or
is only available from another source
E: Package 'sun-java6-jdk' has no installation candidate
According to Happy
Coding, Sun JDK has been removed from the partner archives; therefore I’ll
use the OpenJDK version instead.
maazanjum@hadoop:~$ sudo apt-get install openjdk-6-jdk
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
ca-certificates-java
icedtea-6-jre-cacao icedtea-6-jre-jamvm icedtea-netx icedtea-netx-common
java-common
libatk-wrapper-java
libatk-wrapper-java-jni libgif4 libice-dev libnss3-1d libpthread-stubs0
libpthread-stubs0-dev libsm-dev
libx11-6 libx11-dev
libx11-doc libxau-dev libxcb1 libxcb1-dev libxdmcp-dev libxt-dev libxt6
openjdk-6-jre
…
Unpacking openjdk-6-jdk:amd64 (from
.../openjdk-6-jdk_6b27-1.12.6-1ubuntu0.13.04.2_amd64.deb) ...
Processing triggers for ca-certificates ...
Updating certificates in /etc/ssl/certs... 0 added, 0 removed;
done.
Running hooks in /etc/ca-certificates/update.d....done.
Processing triggers for doc-base ...
Processing 32 changed doc-base files, 2 added doc-base files...
Processing triggers for man-db ...
Processing triggers for bamfdaemon ...
…
done.
done.
maazanjum@hadoop:~$
Add a Hadoop system
user
maazanjum@hadoop:~$ sudo addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
maazanjum@hadoop:~$ id hadoop
id: hadoop: no such user
maazanjum@hadoop:~$ sudo adduser --ingroup hadoop hduser
Adding user `hduser' ...
Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for hduser
Enter the new value, or press ENTER for the default
Full Name []: Hadoop
User
Room Number []:
Work Phone []:
Home Phone []:
Other []:
Is the information correct? [Y/n] y
Configure SSH
maazanjum@hadoop:~$ su - hduser
Password:
hduser@hadoop:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
c6:e3:f1:a4:43:02:21:20:96:7f:29:ce:16:8d:ef:6a hduser@hadoop
The key's randomart image is:
+--[ RSA 2048]----+
|ooo . |
|o. . . |
| . + . |
| = = . |
| o = . S . |
| + . = * |
| . . + .
|
| E . .
|
| ... |
+-----------------+
hduser@hadoop:~$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
I had to install and configure SSH before the next few steps
worked.
hduser@hadoop:~$ sudo apt-get install openssh-server
I also granted hduser sudo access
maazanjum@hadoop:~$ sudo adduser hduser sudo
[sudo] password for maazanjum:
Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.
hduser@hadoop:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be
established.
ECDSA key fingerprint is
27:8e:e7:97:72:a2:08:5e:b2:4e:95:91:61:34:72:3a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of
known hosts.
Welcome to Ubuntu 13.04 (GNU/Linux 3.8.0-19-generic x86_64)
* Documentation: https://help.ubuntu.com/
Disable IPv6
hduser@hadoop:~$ sudo vi /etc/sysctl.conf
[sudo] password for hduser:
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Reboot the machine and check whether its actually disabled.
hduser@hadoop:~$ cat
/proc/sys/net/ipv6/conf/all/disable_ipv6
1
Install Hadoop
hduser@hadoop:~$ cd /usr/local
hduser@hadoop:/usr/local$ ls -lhtr
total 32K
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 src
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 sbin
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 include
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 games
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 etc
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 bin
drwxr-xr-x 4 root root 4.0K Apr 24 10:04 lib
drwxr-xr-x 7 root root 4.0K Apr 24 10:05 share
lrwxrwxrwx 1 root root
9 Sep 2 22:31 man -> share/man
I use the –C option to output tar to a specific directory
hduser@hadoop:/usr/local$ sudo tar -zxvf /tmp/hadoop-1.2.1.tar.gz
-C /usr/local
hadoop-1.2.1/
hadoop-1.2.1/.eclipse.templates/
hadoop-1.2.1/.eclipse.templates/.externalToolBuilders/
hadoop-1.2.1/.eclipse.templates/.launches/
hadoop-1.2.1/bin/
hadoop-1.2.1/c++/
hadoop-1.2.1/c++/Linux-amd64-64/
…
I personally prefer to link the binaries to a generic
“hadoop” folder.
hduser@hadoop:/usr/local$ sudo ln -s hadoop-1.2.1/ hadoop
hduser@hadoop:/usr/local$ ls -lhtr
total 36K
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01 src
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01 sbin
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01
include
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01 games
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01 etc
drwxr-xr-x 2 root
root 4.0K Apr 24 10:01 bin
drwxr-xr-x 4 root
root 4.0K Apr 24 10:04 lib
drwxr-xr-x 7 root
root 4.0K Apr 24 10:05 share
drwxr-xr-x 9 maazanjum maazanjum 4.0K Aug 15 22:15 hadoop-1.2.1
lrwxrwxrwx 1 root
root 9 Sep 2 22:31 man -> share/man
lrwxrwxrwx 1 root
root 19 Sep 3 12:24 hadoop -> hadoop-1.2.1/
Change ownership to hduser.
hduser@hadoop:/usr/local$ sudo chown -R hduser:hadoop hadoop*
hduser@hadoop:/usr/local$ ls -lhtr
total 36K
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 src
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 sbin
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 include
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 games
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 etc
drwxr-xr-x 2 root root
4.0K Apr 24 10:01 bin
drwxr-xr-x 4 root root
4.0K Apr 24 10:04 lib
drwxr-xr-x 7 root root
4.0K Apr 24 10:05 share
drwxr-xr-x 15 hduser hadoop 4.0K Jul 22 15:26 hadoop-1.2.1
lrwxrwxrwx 1 root root
9 Sep 2 22:31 man -> share/man
lrwxrwxrwx 1 hduser hadoop 13 Sep
3 13:14 hadoop -> hadoop-1.2.1/
Configure .bashrc
file
I appended this to my .profile file.
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for
Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64
# Some convenient aliases and functions for running
Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster
and
# compress job outputs with LZOP (not covered in this
tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 |
lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
Since the article I followed details a single-node setup, I
will follow with it for now. Later posts will detail a multimode setup.
Configuring Hadoop
hadoop-env.sh
Edit the string below
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
Change it to
# The java implementation to use. Required.
export
JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64
conf/*-site.xml
I’ll leave the defaults and create the working temporary
directory for hadoop.
hduser@hadoop:/usr/local/hadoop/conf$
sudo mkdir -p /app/hadoop/tmp
hduser@hadoop:/usr/local/hadoop/conf$
sudo chown hduser:hadoop /app/hadoop/tmp
hduser@hadoop:/usr/local/hadoop/conf$
sudo chmod 750 /app/hadoop/tmp
Follow instructions in this article
to configure the *-site.xml files.
Getting Started
Format the HDFS from
the NameNode
hduser@hadoop:~$ /usr/local/hadoop/bin/hadoop namenode -format
Warning: $HADOOP_HOME is deprecated.
13/09/03 13:36:28 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host =
hadoop/192.168.182.133
STARTUP_MSG: args =
[-format]
STARTUP_MSG: version =
1.2.1
STARTUP_MSG: build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152;
compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG: java =
1.6.0_27
************************************************************/
Re-format filesystem in /app/hadoop/tmp/dfs/name ? (Y or N) y
Format aborted in /app/hadoop/tmp/dfs/name
13/09/03 13:36:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.182.133
************************************************************/
Startup
hduser@hadoop:~$ /usr/local/hadoop/bin/start-all.sh
Warning: $HADOOP_HOME is deprecated.
starting namenode, logging to
/usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-hadoop.out
localhost: starting datanode, logging to
/usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hadoop.out
localhost: starting secondarynamenode, logging to
/usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-hadoop.out
starting jobtracker, logging to
/usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-hadoop.out
localhost: starting tasktracker, logging to
/usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hadoop.out
Check whether Hadoop is up and running.
hduser@hadoop:~$ jps
2565 DataNode
3218 Jps
2311 NameNode
2797 SecondaryNameNode
3102 TaskTracker
2873 JobTracker
Shutdown
hduser@hadoop:/usr/local/hadoop$
bin/stop-all.sh
Warning: $HADOOP_HOME is deprecated.
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping
secondarynamenode
In the next article, I will go over some of the examples I’ve
found that are useful to understand questions Hadoop can help answer.
Cheers!
Saturday, August 31, 2013
Adventures in Hadoop: #1 The First Step is the Most Important
I consider myself a tenaciously curious person. In the spirit of "discovery" I've embarked on learning Hadoop and, subsequently the various bits and pieces that are associated with it. Since I am now "into" blogging, it occurred to me that there might be others like myself who are keen on learning about Hadoop. Below is a listed of useful blogs I visited in my quest of knowledge.
This picture (from my daughters room) aptly represents me vs BIG DATA :)
Note: This blog is a Work In-Progress (WIP). Please revisit it frequently for updated content :)
Hadoop
Without a doubt a useful technology when applied to the correct use-case. I think it all boils down to "What is your question?". But, before I got too philosophical, the more relevant question was "How does it work?". I stumbled on to Michael Noll's tutorial on configuring a Single Node Cluster. He did an amazing job creating step by step documentation on the setup. It was easy enough to configure it and test with the Gutenberg examples
Don't forget to check out the Web Interface for the NameNode, JobTracker and, TaskTracker.
Pig
At this point, I was thinking "Great, I have a Hadoop install but, how do I easily get it to do my work?". I mean, I can program in Java but I'm no Ace! Enter, Pig Latin.
Once again, I found an excellent article by Wayne Adams which outlined how to leverage Pig to "ask" the question. He used the data dumps available for New Issues Pool Statistics to illustrate how Pig Latin is utilized on Hadoop.
Hive
Again, as I mentioned above, I'm from a DBA background so queries are familiar to me. Hive is a great add-on to Hadoop which allows for a SQL interface approach to NoSQL. I'm thinking External Tables in Oracle when I created the tables from Ben Hidalgo's example.
Conclusion
I tend to drift towards over-simplification at times, and since I come from a DBA background with development roots, I like to use the "You get what you ask" analogy when dealing with an instance. For example, if you ask for a lot of data, well, you're going to get it and - unless you're on something like an Exadata machine - it might take a while. You know, as a stupid question and you'll get a stupid answer type of deal. The point of my rant is, from what I surmise, Hadoop (NoSQL) has its place for certain use-cases and the "right" solution depends on the "right" question.
I'm planning to rebuild this environment because its been a couple of weeks since I last tinkered with it. I aim to provide more details on this blog for each step. I've started working with R and how - at the very least I - can use it for my every day work.
Next in Series: #2 Starting from Scratch
Next in Series: #2 Starting from Scratch
Other Useful Links
Subscribe to:
Posts
(
Atom
)