Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Friday, September 6, 2013

Adventures in Hadoop: #2 Starting from Scratch

As I had mentioned in my previous post, I wanted to collect my experience from various sources. This post is intended as a continuing series of posts in which I’d like to share my learning experience in Hadoop.

Disclaimer: I have used Michael Noll’s article as a base for my own exercise. I tend to deviate slightly from his instructions and will expand in later posts

Add Software Repositories

maazanjum@hadoop:~$ sudo apt-get install python-software-properties
[sudo] password for maazanjum: 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  python-software-properties
0 upgraded, 1 newly installed, 0 to remove and 212 not upgraded.
Need to get 19.1 kB of archives.
After this operation, 132 kB of additional disk space will be used.
Get:1 http://us.archive.ubuntu.com/ubuntu/ raring-updates/universe python-software-properties all 0.92.17.1 [19.1 kB]
Fetched 19.1 kB in 0s (78.6 kB/s)               
Selecting previously unselected package python-software-properties.
(Reading database ... 155358 files and directories currently installed.)
Unpacking python-software-properties (from .../python-software-properties_0.92.17.1_all.deb) ...
Setting up python-software-properties (0.92.17.1) ...

maazanjum@hadoop:~$ sudo add-apt-repository ppa:ferramroberto/java
You are about to add the following PPA to your system:
 PPA esclusivo per l'ultima versione disponibile di JAVA

PPA for the latest version of JAVA

PPA für die neueste Version von JAVA

PPA para la última versión de JAVA

PPA pour la dernière version de JAVA


by LffL http://www.lffl.org

 More info: https://launchpad.net/~ferramroberto/+archive/java
Press [ENTER] to continue or ctrl-c to cancel adding it

gpg: keyring `/tmp/tmpslw4kg/secring.gpg' created
gpg: keyring `/tmp/tmpslw4kg/pubring.gpg' created
gpg: requesting key 3ACC3965 from hkp server keyserver.ubuntu.com
gpg: /tmp/tmpslw4kg/trustdb.gpg: trustdb created
gpg: key 3ACC3965: public key "Launchpad lffl" imported
gpg: no ultimately trusted keys found
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
OK

Update Source List

maazanjum@hadoop:~$ sudo apt-get update
[sudo] password for maazanjum:
Ign http://ppa.launchpad.net raring Release.gpg
Ign http://ppa.launchpad.net raring Release  
                                                               

Install Sun Java 6 JDK

maazanjum@hadoop:~$ sudo apt-get install sun-java6-jdk
Reading package lists... Done
Building dependency tree      
Reading state information... Done
Package sun-java6-jdk is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'sun-java6-jdk' has no installation candidate

According to Happy Coding, Sun JDK has been removed from the partner archives; therefore I’ll use the OpenJDK version instead.

maazanjum@hadoop:~$ sudo apt-get install openjdk-6-jdk
Reading package lists... Done
Building dependency tree      
Reading state information... Done
The following extra packages will be installed:
  ca-certificates-java icedtea-6-jre-cacao icedtea-6-jre-jamvm icedtea-netx icedtea-netx-common java-common
  libatk-wrapper-java libatk-wrapper-java-jni libgif4 libice-dev libnss3-1d libpthread-stubs0 libpthread-stubs0-dev libsm-dev
  libx11-6 libx11-dev libx11-doc libxau-dev libxcb1 libxcb1-dev libxdmcp-dev libxt-dev libxt6 openjdk-6-jre
Unpacking openjdk-6-jdk:amd64 (from .../openjdk-6-jdk_6b27-1.12.6-1ubuntu0.13.04.2_amd64.deb) ...
Processing triggers for ca-certificates ...
Updating certificates in /etc/ssl/certs... 0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d....done.
Processing triggers for doc-base ...
Processing 32 changed doc-base files, 2 added doc-base files...
Processing triggers for man-db ...
Processing triggers for bamfdaemon ...
done.
done.
maazanjum@hadoop:~$

Add a Hadoop system user

maazanjum@hadoop:~$ sudo addgroup hadoop
Adding group `hadoop' (GID 1001) ...
Done.
maazanjum@hadoop:~$ id hadoop
id: hadoop: no such user
maazanjum@hadoop:~$ sudo adduser --ingroup hadoop hduser
Adding user `hduser' ...
Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for hduser
Enter the new value, or press ENTER for the default
      Full Name []: Hadoop User
      Room Number []:
      Work Phone []:
      Home Phone []:
      Other []:
Is the information correct? [Y/n] y

Configure SSH

maazanjum@hadoop:~$ su - hduser
Password:
hduser@hadoop:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
c6:e3:f1:a4:43:02:21:20:96:7f:29:ce:16:8d:ef:6a hduser@hadoop
The key's randomart image is:
+--[ RSA 2048]----+
|ooo .            |
|o. . .           |
|  . + .          |
|   = = .         |
|  o = . S .      |
|   + . = *       |
|  . .   + .      |
|   E .   .       |
|  ...            |
+-----------------+

hduser@hadoop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

I had to install and configure SSH before the next few steps worked.

hduser@hadoop:~$ sudo apt-get install openssh-server

I also granted hduser sudo access

maazanjum@hadoop:~$ sudo adduser hduser sudo
[sudo] password for maazanjum:
Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.

hduser@hadoop:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is 27:8e:e7:97:72:a2:08:5e:b2:4e:95:91:61:34:72:3a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 13.04 (GNU/Linux 3.8.0-19-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

Disable IPv6

hduser@hadoop:~$ sudo vi /etc/sysctl.conf
[sudo] password for hduser:
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Reboot the machine and check whether its actually disabled.

hduser@hadoop:~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
1

Install Hadoop


hduser@hadoop:~$ cd /usr/local
hduser@hadoop:/usr/local$ ls -lhtr
total 32K
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 src
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 sbin
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 include
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 games
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 etc
drwxr-xr-x 2 root root 4.0K Apr 24 10:01 bin
drwxr-xr-x 4 root root 4.0K Apr 24 10:04 lib
drwxr-xr-x 7 root root 4.0K Apr 24 10:05 share
lrwxrwxrwx 1 root root    9 Sep  2 22:31 man -> share/man

I use the –C option to output tar to a specific directory

hduser@hadoop:/usr/local$ sudo tar -zxvf /tmp/hadoop-1.2.1.tar.gz -C /usr/local
hadoop-1.2.1/
hadoop-1.2.1/.eclipse.templates/
hadoop-1.2.1/.eclipse.templates/.externalToolBuilders/
hadoop-1.2.1/.eclipse.templates/.launches/
hadoop-1.2.1/bin/
hadoop-1.2.1/c++/
hadoop-1.2.1/c++/Linux-amd64-64/

I personally prefer to link the binaries to a generic “hadoop” folder.

hduser@hadoop:/usr/local$ sudo ln -s hadoop-1.2.1/ hadoop
hduser@hadoop:/usr/local$ ls -lhtr
total 36K
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 src
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 sbin
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 include
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 games
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 etc
drwxr-xr-x 2 root      root      4.0K Apr 24 10:01 bin
drwxr-xr-x 4 root      root      4.0K Apr 24 10:04 lib
drwxr-xr-x 7 root      root      4.0K Apr 24 10:05 share
drwxr-xr-x 9 maazanjum maazanjum 4.0K Aug 15 22:15 hadoop-1.2.1
lrwxrwxrwx 1 root      root         9 Sep  2 22:31 man -> share/man
lrwxrwxrwx 1 root      root        19 Sep  3 12:24 hadoop -> hadoop-1.2.1/

Change ownership to hduser.

hduser@hadoop:/usr/local$ sudo chown -R hduser:hadoop hadoop*
hduser@hadoop:/usr/local$ ls -lhtr
total 36K
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 src
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 sbin
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 include
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 games
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 etc
drwxr-xr-x  2 root   root   4.0K Apr 24 10:01 bin
drwxr-xr-x  4 root   root   4.0K Apr 24 10:04 lib
drwxr-xr-x  7 root   root   4.0K Apr 24 10:05 share
drwxr-xr-x 15 hduser hadoop 4.0K Jul 22 15:26 hadoop-1.2.1
lrwxrwxrwx  1 root   root      9 Sep  2 22:31 man -> share/man
lrwxrwxrwx  1 hduser hadoop   13 Sep  3 13:14 hadoop -> hadoop-1.2.1/

Configure .bashrc file

I appended this to my .profile file.

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Since the article I followed details a single-node setup, I will follow with it for now. Later posts will detail a multimode setup.

Configuring Hadoop


hadoop-env.sh

Edit the string below

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

Change it to

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64

conf/*-site.xml

I’ll leave the defaults and create the working temporary directory for hadoop.

hduser@hadoop:/usr/local/hadoop/conf$ sudo mkdir -p /app/hadoop/tmp
hduser@hadoop:/usr/local/hadoop/conf$ sudo chown hduser:hadoop /app/hadoop/tmp
hduser@hadoop:/usr/local/hadoop/conf$ sudo chmod 750 /app/hadoop/tmp

Follow instructions in this article to configure the *-site.xml files.

Getting Started


Format the HDFS from the NameNode

hduser@hadoop:~$ /usr/local/hadoop/bin/hadoop namenode -format
Warning: $HADOOP_HOME is deprecated.

13/09/03 13:36:28 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop/192.168.182.133
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.6.0_27
************************************************************/
Re-format filesystem in /app/hadoop/tmp/dfs/name ? (Y or N) y
Format aborted in /app/hadoop/tmp/dfs/name
13/09/03 13:36:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.182.133
************************************************************/

Startup

hduser@hadoop:~$ /usr/local/hadoop/bin/start-all.sh
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-hadoop.out
localhost: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hadoop.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-hadoop.out
starting jobtracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-hadoop.out
localhost: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hadoop.out

Check whether Hadoop is up and running.

hduser@hadoop:~$ jps
2565 DataNode
3218 Jps
2311 NameNode
2797 SecondaryNameNode
3102 TaskTracker
2873 JobTracker

Shutdown

hduser@hadoop:/usr/local/hadoop$ bin/stop-all.sh
Warning: $HADOOP_HOME is deprecated.

stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

In the next article, I will go over some of the examples I’ve found that are useful to understand questions Hadoop can help answer.

Cheers!




continue reading "Adventures in Hadoop: #2 Starting from Scratch"

Saturday, August 31, 2013

Adventures in Hadoop: #1 The First Step is the Most Important

I consider myself a tenaciously curious person. In the spirit of "discovery" I've embarked on learning Hadoop and, subsequently the various bits and pieces that are associated with it. Since I am now "into" blogging, it occurred to me that there might be others like myself who are keen on learning about Hadoop. Below is a listed of useful blogs I visited in my quest of knowledge.

This picture (from my daughters room) aptly represents me vs BIG DATA :)



Note: This blog is a Work In-Progress (WIP). Please revisit it frequently for updated content :)

Hadoop
Without a doubt a useful technology when applied to the correct use-case. I think it all boils down to "What is your question?". But, before I got too philosophical, the more relevant question was "How does it work?". I stumbled on to Michael Noll's tutorial on configuring a Single Node Cluster. He did an amazing job creating step by step documentation on the setup. It was easy enough to configure it and test with the Gutenberg examples

Don't forget to check out the Web Interface for the NameNode, JobTracker and, TaskTracker.

Pig
At this point, I was thinking "Great, I have a Hadoop install but, how do I easily get it to do my work?". I mean, I can program in Java but I'm no Ace! Enter, Pig Latin.

Once again, I found an excellent article by Wayne Adams which outlined how to leverage Pig to "ask" the question. He used the data dumps available for New Issues Pool Statistics to illustrate how Pig Latin is utilized on Hadoop.

Hive
Again, as I mentioned above, I'm from a DBA background so queries are familiar to me. Hive is a great add-on to Hadoop which allows for a SQL interface approach to NoSQL. I'm thinking External Tables in Oracle when I created the tables from Ben Hidalgo's example.

Conclusion
I tend to drift towards over-simplification at times, and since I come from a DBA background with development roots, I like to use the "You get what you ask" analogy when dealing with an instance. For example, if you ask for a lot of data, well, you're going to get it and - unless you're on something like an Exadata machine - it might take a while. You know, as a stupid question and you'll get a stupid answer type of deal. The point of my rant is, from what I surmise, Hadoop (NoSQL) has its place for certain use-cases and the "right" solution depends on the "right" question.

I'm planning to rebuild this environment because its been a couple of weeks since I last tinkered with it. I aim to provide more details on this blog for each step. I've started working with R and how - at the very least I - can use it for my every day work.

Next in Series: #2 Starting from Scratch

Other Useful Links

continue reading "Adventures in Hadoop: #1 The First Step is the Most Important"