Saturday, August 31, 2013

Adventures in Hadoop: #1 The First Step is the Most Important

I consider myself a tenaciously curious person. In the spirit of "discovery" I've embarked on learning Hadoop and, subsequently the various bits and pieces that are associated with it. Since I am now "into" blogging, it occurred to me that there might be others like myself who are keen on learning about Hadoop. Below is a listed of useful blogs I visited in my quest of knowledge.

This picture (from my daughters room) aptly represents me vs BIG DATA :)



Note: This blog is a Work In-Progress (WIP). Please revisit it frequently for updated content :)

Hadoop
Without a doubt a useful technology when applied to the correct use-case. I think it all boils down to "What is your question?". But, before I got too philosophical, the more relevant question was "How does it work?". I stumbled on to Michael Noll's tutorial on configuring a Single Node Cluster. He did an amazing job creating step by step documentation on the setup. It was easy enough to configure it and test with the Gutenberg examples

Don't forget to check out the Web Interface for the NameNode, JobTracker and, TaskTracker.

Pig
At this point, I was thinking "Great, I have a Hadoop install but, how do I easily get it to do my work?". I mean, I can program in Java but I'm no Ace! Enter, Pig Latin.

Once again, I found an excellent article by Wayne Adams which outlined how to leverage Pig to "ask" the question. He used the data dumps available for New Issues Pool Statistics to illustrate how Pig Latin is utilized on Hadoop.

Hive
Again, as I mentioned above, I'm from a DBA background so queries are familiar to me. Hive is a great add-on to Hadoop which allows for a SQL interface approach to NoSQL. I'm thinking External Tables in Oracle when I created the tables from Ben Hidalgo's example.

Conclusion
I tend to drift towards over-simplification at times, and since I come from a DBA background with development roots, I like to use the "You get what you ask" analogy when dealing with an instance. For example, if you ask for a lot of data, well, you're going to get it and - unless you're on something like an Exadata machine - it might take a while. You know, as a stupid question and you'll get a stupid answer type of deal. The point of my rant is, from what I surmise, Hadoop (NoSQL) has its place for certain use-cases and the "right" solution depends on the "right" question.

I'm planning to rebuild this environment because its been a couple of weeks since I last tinkered with it. I aim to provide more details on this blog for each step. I've started working with R and how - at the very least I - can use it for my every day work.

Next in Series: #2 Starting from Scratch

Other Useful Links

No comments :

Post a Comment