Showing posts with label Big Data. Show all posts

Friday, September 6, 2013

Adventures in Hadoop: #2 Starting from Scratch

As I had mentioned in my previous post, I wanted to collect my experience from various sources. This post is intended as a continuing series of posts in which I’d like to share my learning experience in Hadoop.

Disclaimer: I have used Michael Noll’s article as a base for my own exercise. I tend to deviate slightly from his instructions and will expand in later posts

Add Software Repositories

maazanjum@hadoop:~$ sudo apt-get install python-software-properties

[sudo] password for maazanjum:

Reading package lists... Done

Building dependency tree

Reading state information... Done

The following NEW packages will be installed:

python-software-properties

0 upgraded, 1 newly installed, 0 to remove and 212 not upgraded.

Need to get 19.1 kB of archives.

After this operation, 132 kB of additional disk space will be used.

Get:1 http://us.archive.ubuntu.com/ubuntu/ raring-updates/universe python-software-properties all 0.92.17.1 [19.1 kB]

Fetched 19.1 kB in 0s (78.6 kB/s)

Selecting previously unselected package python-software-properties.

(Reading database ... 155358 files and directories currently installed.)

Unpacking python-software-properties (from .../python-software-properties_0.92.17.1_all.deb) ...

Setting up python-software-properties (0.92.17.1) ...

maazanjum@hadoop:~$ sudo add-apt-repository ppa:ferramroberto/java

You are about to add the following PPA to your system:

PPA esclusivo per l'ultima versione disponibile di JAVA

PPA for the latest version of JAVA

PPA für die neueste Version von JAVA

PPA para la última versión de JAVA

PPA pour la dernière version de JAVA

by LffL http://www.lffl.org

More info: https://launchpad.net/~ferramroberto/+archive/java

Press [ENTER] to continue or ctrl-c to cancel adding it

gpg: keyring `/tmp/tmpslw4kg/secring.gpg' created

gpg: keyring `/tmp/tmpslw4kg/pubring.gpg' created

gpg: requesting key 3ACC3965 from hkp server keyserver.ubuntu.com

gpg: /tmp/tmpslw4kg/trustdb.gpg: trustdb created

gpg: key 3ACC3965: public key "Launchpad lffl" imported

gpg: no ultimately trusted keys found

gpg: Total number processed: 1

gpg: imported: 1 (RSA: 1)

Update Source List

maazanjum@hadoop:~$ sudo apt-get update

[sudo] password for maazanjum:

Ign http://ppa.launchpad.net raring Release.gpg

Ign http://ppa.launchpad.net raring Release

…

Install Sun Java 6 JDK

maazanjum@hadoop:~$ sudo apt-get install sun-java6-jdk

Reading package lists... Done

Building dependency tree

Reading state information... Done

Package sun-java6-jdk is not available, but is referred to by another package.

This may mean that the package is missing, has been obsoleted, or

is only available from another source

E: Package 'sun-java6-jdk' has no installation candidate

According to Happy Coding, Sun JDK has been removed from the partner archives; therefore I’ll use the OpenJDK version instead.

maazanjum@hadoop:~$ sudo apt-get install openjdk-6-jdk

Reading package lists... Done

Building dependency tree

Reading state information... Done

The following extra packages will be installed:

ca-certificates-java icedtea-6-jre-cacao icedtea-6-jre-jamvm icedtea-netx icedtea-netx-common java-common

libatk-wrapper-java libatk-wrapper-java-jni libgif4 libice-dev libnss3-1d libpthread-stubs0 libpthread-stubs0-dev libsm-dev

libx11-6 libx11-dev libx11-doc libxau-dev libxcb1 libxcb1-dev libxdmcp-dev libxt-dev libxt6 openjdk-6-jre

…

Unpacking openjdk-6-jdk:amd64 (from .../openjdk-6-jdk_6b27-1.12.6-1ubuntu0.13.04.2_amd64.deb) ...

Processing triggers for ca-certificates ...

Updating certificates in /etc/ssl/certs... 0 added, 0 removed; done.

Running hooks in /etc/ca-certificates/update.d....done.

Processing triggers for doc-base ...

Processing 32 changed doc-base files, 2 added doc-base files...

Processing triggers for man-db ...

Processing triggers for bamfdaemon ...

…

done.

maazanjum@hadoop:~$

Add a Hadoop system user

maazanjum@hadoop:~$ sudo addgroup hadoop

Adding group `hadoop' (GID 1001) ...

Done.

maazanjum@hadoop:~$ id hadoop

id: hadoop: no such user

maazanjum@hadoop:~$ sudo adduser --ingroup hadoop hduser

Adding user `hduser' ...

Adding new user `hduser' (1001) with group `hadoop' ...

Creating home directory `/home/hduser' ...

Copying files from `/etc/skel' ...

Enter new UNIX password:

Retype new UNIX password:

passwd: password updated successfully

Changing the user information for hduser

Enter the new value, or press ENTER for the default

Full Name []: Hadoop User

Room Number []:

Work Phone []:

Home Phone []:

Other []:

Is the information correct? [Y/n] y

Configure SSH

maazanjum@hadoop:~$ su - hduser

Password:

hduser@hadoop:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/home/hduser/.ssh/id_rsa):

Created directory '/home/hduser/.ssh'.

Your identification has been saved in /home/hduser/.ssh/id_rsa.

Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

The key fingerprint is:

c6:e3:f1:a4:43:02:21:20:96:7f:29:ce:16:8d:ef:6a hduser@hadoop

The key's randomart image is:

+--[ RSA 2048]----+

|ooo . |

|o. . . |

| . + . |

| = = . |

| o = . S . |

| + . = * |

| . . + . |

| E . . |

| ... |

+-----------------+

hduser@hadoop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

I had to install and configure SSH before the next few steps worked.

hduser@hadoop:~$ sudo apt-get install openssh-server

I also granted hduser sudo access

maazanjum@hadoop:~$ sudo adduser hduser sudo

[sudo] password for maazanjum:

Adding user `hduser' to group `sudo' ...

Adding user hduser to group sudo

Done.

hduser@hadoop:~$ ssh localhost

The authenticity of host 'localhost (127.0.0.1)' can't be established.

ECDSA key fingerprint is 27:8e:e7:97:72:a2:08:5e:b2:4e:95:91:61:34:72:3a.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.

Welcome to Ubuntu 13.04 (GNU/Linux 3.8.0-19-generic x86_64)

* Documentation: https://help.ubuntu.com/

Disable IPv6

hduser@hadoop:~$ sudo vi /etc/sysctl.conf

[sudo] password for hduser:

# disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

Reboot the machine and check whether its actually disabled.

hduser@hadoop:~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Install Hadoop

hduser@hadoop:~$ cd /usr/local

hduser@hadoop:/usr/local$ ls -lhtr

total 32K

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 src

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 sbin

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 include

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 games

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 etc

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 bin

drwxr-xr-x 4 root root 4.0K Apr 24 10:04 lib

drwxr-xr-x 7 root root 4.0K Apr 24 10:05 share

lrwxrwxrwx 1 root root 9 Sep 2 22:31 man -> share/man

I use the –C option to output tar to a specific directory

hduser@hadoop:/usr/local$ sudo tar -zxvf /tmp/hadoop-1.2.1.tar.gz -C /usr/local

hadoop-1.2.1/

hadoop-1.2.1/.eclipse.templates/

hadoop-1.2.1/.eclipse.templates/.externalToolBuilders/

hadoop-1.2.1/.eclipse.templates/.launches/

hadoop-1.2.1/bin/

hadoop-1.2.1/c++/

hadoop-1.2.1/c++/Linux-amd64-64/

…

I personally prefer to link the binaries to a generic “hadoop” folder.

hduser@hadoop:/usr/local$ sudo ln -s hadoop-1.2.1/ hadoop

hduser@hadoop:/usr/local$ ls -lhtr

total 36K

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 src

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 sbin

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 include

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 games

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 etc

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 bin

drwxr-xr-x 4 root root 4.0K Apr 24 10:04 lib

drwxr-xr-x 7 root root 4.0K Apr 24 10:05 share

drwxr-xr-x 9 maazanjum maazanjum 4.0K Aug 15 22:15 hadoop-1.2.1

lrwxrwxrwx 1 root root 9 Sep 2 22:31 man -> share/man

lrwxrwxrwx 1 root root 19 Sep 3 12:24 hadoop -> hadoop-1.2.1/

Change ownership to hduser.

hduser@hadoop:/usr/local$ sudo chown -R hduser:hadoop hadoop*

hduser@hadoop:/usr/local$ ls -lhtr

total 36K

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 src

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 sbin

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 include

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 games

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 etc

drwxr-xr-x 2 root root 4.0K Apr 24 10:01 bin

drwxr-xr-x 4 root root 4.0K Apr 24 10:04 lib

drwxr-xr-x 7 root root 4.0K Apr 24 10:05 share

drwxr-xr-x 15 hduser hadoop 4.0K Jul 22 15:26 hadoop-1.2.1

lrwxrwxrwx 1 root root 9 Sep 2 22:31 man -> share/man

lrwxrwxrwx 1 hduser hadoop 13 Sep 3 13:14 hadoop -> hadoop-1.2.1/

Configure .bashrc file

I appended this to my .profile file.

# Set Hadoop-related environment variables

export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)

export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64

# Some convenient aliases and functions for running Hadoop-related commands

unalias fs &> /dev/null

alias fs="hadoop fs"

unalias hls &> /dev/null

alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and

# compress job outputs with LZOP (not covered in this tutorial):

# Conveniently inspect an LZOP compressed file from the command

# line; run via:

# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo

# Requires installed 'lzop' command.

lzohead () {

hadoop fs -cat $1 | lzop -dc | head -1000 | less

}

# Add Hadoop bin/ directory to PATH

export PATH=$PATH:$HADOOP_HOME/bin

Since the article I followed details a single-node setup, I will follow with it for now. Later posts will detail a multimode setup.

Configuring Hadoop

hadoop-env.sh

Edit the string below

# The java implementation to use. Required.

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

Change it to

# The java implementation to use. Required.

export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64

conf/*-site.xml

I’ll leave the defaults and create the working temporary directory for hadoop.

hduser@hadoop:/usr/local/hadoop/conf$ sudo mkdir -p /app/hadoop/tmp

hduser@hadoop:/usr/local/hadoop/conf$ sudo chown hduser:hadoop /app/hadoop/tmp

hduser@hadoop:/usr/local/hadoop/conf$ sudo chmod 750 /app/hadoop/tmp

Follow instructions in this article to configure the *-site.xml files.

Getting Started

Format the HDFS from the NameNode

hduser@hadoop:~$ /usr/local/hadoop/bin/hadoop namenode -format

Warning: $HADOOP_HOME is deprecated.

13/09/03 13:36:28 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = hadoop/192.168.182.133

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 1.2.1

STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013

STARTUP_MSG: java = 1.6.0_27

************************************************************/

Re-format filesystem in /app/hadoop/tmp/dfs/name ? (Y or N) y

Format aborted in /app/hadoop/tmp/dfs/name

13/09/03 13:36:29 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at hadoop/192.168.182.133

************************************************************/

Startup

hduser@hadoop:~$ /usr/local/hadoop/bin/start-all.sh

Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-hadoop.out

localhost: starting datanode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-hadoop.out

localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-hadoop.out

starting jobtracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-hadoop.out

localhost: starting tasktracker, logging to /usr/local/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-hadoop.out

Check whether Hadoop is up and running.

hduser@hadoop:~$ jps

2565 DataNode

3218 Jps

2311 NameNode

2797 SecondaryNameNode

3102 TaskTracker

2873 JobTracker

Shutdown

hduser@hadoop:/usr/local/hadoop$ bin/stop-all.sh

Warning: $HADOOP_HOME is deprecated.

stopping jobtracker

localhost: stopping tasktracker

stopping namenode

localhost: stopping datanode

localhost: stopping secondarynamenode

In the next article, I will go over some of the examples I’ve found that are useful to understand questions Hadoop can help answer.

Cheers!

continue reading "Adventures in Hadoop: #2 Starting from Scratch"

Add Your Opinion 0 Opinions

Saturday, August 31, 2013

Adventures in Hadoop: #1 The First Step is the Most Important

I consider myself a tenaciously curious person. In the spirit of "discovery" I've embarked on learning Hadoop and, subsequently the various bits and pieces that are associated with it. Since I am now "into" blogging, it occurred to me that there might be others like myself who are keen on learning about Hadoop. Below is a listed of useful blogs I visited in my quest of knowledge.

This picture (from my daughters room) aptly represents me vs BIG DATA :)

Note: This blog is a Work In-Progress (WIP). Please revisit it frequently for updated content :)

Hadoop

Without a doubt a useful technology when applied to the correct use-case. I think it all boils down to "What is your question?". But, before I got too philosophical, the more relevant question was "How does it work?". I stumbled on to Michael Noll's tutorial on configuring a Single Node Cluster. He did an amazing job creating step by step documentation on the setup. It was easy enough to configure it and test with the Gutenberg examples

Don't forget to check out the Web Interface for the NameNode, JobTracker and, TaskTracker.

Pig

At this point, I was thinking "Great, I have a Hadoop install but, how do I easily get it to do my work?". I mean, I can program in Java but I'm no Ace! Enter, Pig Latin.

Once again, I found an excellent article by Wayne Adams which outlined how to leverage Pig to "ask" the question. He used the data dumps available for New Issues Pool Statistics to illustrate how Pig Latin is utilized on Hadoop.

Hive

Again, as I mentioned above, I'm from a DBA background so queries are familiar to me. Hive is a great add-on to Hadoop which allows for a SQL interface approach to NoSQL. I'm thinking External Tables in Oracle when I created the tables from Ben Hidalgo's example.

Conclusion

I tend to drift towards over-simplification at times, and since I come from a DBA background with development roots, I like to use the "You get what you ask" analogy when dealing with an instance. For example, if you ask for a lot of data, well, you're going to get it and - unless you're on something like an Exadata machine - it might take a while. You know, as a stupid question and you'll get a stupid answer type of deal. The point of my rant is, from what I surmise, Hadoop (NoSQL) has its place for certain use-cases and the "right" solution depends on the "right" question.

I'm planning to rebuild this environment because its been a couple of weeks since I last tinkered with it. I aim to provide more details on this blog for each step. I've started working with R and how - at the very least I - can use it for my every day work.

Next in Series: #2 Starting from Scratch

Other Useful Links

http://hive.apache.org/

continue reading "Adventures in Hadoop: #1 The First Step is the Most Important"

Add Your Opinion 0 Opinions