Wednesday, 6 May 2015

What is Hadoop ?

In this article I’ll briefly discuss what Hadoop is and why it is needed.

Definition – “Hadoop is an open source software project that enables the distributed processing of large amount of data sets across clusters of commodity servers.”

Before breaking this definition further down, we can easily understand from the above definition that it is a solution for large scale distributed computing. So let us first discuss the limitations of traditional large scale computation. We will then be able to appreciate the power of Hadoop.

Today we live in a Data-Driven world. Data is everywhere and in huge numbers. We need a mechanism to not only store such huge amount of data but also analyse it to make sense out of it.

The disk storage capacity has increased considerably over the period of time. With distributed computing, it is possible to store such huge data across hundreds of computers, but the bottle neck is how to analyse the data. As most of the analysis tasks require the data to be copied from all the sources into the compute nodes. This brings along with it many challenges such as synchronization issues, load balancing and hardware failures. The programmer then has to spend more time designing for failure rather than spending time on the actual problem itself.

Well then, what is the solution? It is Hadoop! It takes care of all the issues such as data synchronization and hardware failures. Additionally it provides a scalable and cost effective solution.

Hadoop is based on papers published by Google in 2003 and 2004 (on Google File System and Google Map Reduce). There are two main components of Hadoop :-

    HDFS – Hadoop distributed file system
    MapReduce – Programming model in which a task is broken down into small pieces where each piece is run on a single node of the cluster in parallel with other pieces.
    New release has YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications

We will revisit the definition we provided at the start and try to break it down.

“Hadoop is an open source software project that enables the distributed processing of large amount of data sets across clusters of commodity servers”

Open source – Hadoop is a free Java based programming framework. It is part of the Apache project sponsored by the Apache Software Foundation.

Distributed processing – Hadoop uses MapReduce as its programming model for analysis.

Large amount of data – Hadoop is capable of handling files that are gigabytes or terabytes in size.

Cluster of commodity servers – Hadoop is highly fault tolerant system. Therefore it doesn’t require highly sophisticated hardware. It can run on commonly available hardware.

This post just provides the overview of Hadoop. There are many other components in the ecosystem other that HDFS and MapReduce such as Pig, Hive, HBase, Flume, Oozie, Sqoop etc.

We will cover HDFS and MapReduce in subsequent post.





References :-

    http://www-01.ibm.com/software/data/infosphere/hadoop/
    Hadoop – The Definitive Guide (Tom White)


If you know anyone who has started learning Hadoop, why not help them out! Just share this post with them. Thanks for studying today!...

No comments:

Post a Comment