MENU

Executive Series: Hadoop Technical Overview

October 1, 2016 Hadoop, Peaxy Aureum, Executive Series

What is Hadoop?

Hadoop is a distributed architecture and infrastructure for storing and processing Big Data. See the previous post, What is Hadoop and Why Should I Care?

What are the challenges of building a system like Hadoop?

Mainly, reliability and scalability. Suppose the data is distributed across 1,000 commodity computers. With that many nodes, the mean time between failure of any component is less than one day! How do you access the data on that failed node? The system must tolerate frequent faults yet guarantee rapid data availability across a system of vast size.

What are Hadoop’s components?

Hadoop Distributed File System (HDFS): the way data is distributed and stored on clusters of commodity nodes.

MapReduce: the way data questions are posed and data is processed across the clusters.


How did they make it reliable?

Hadoop’s reliability comes from running on identical nodes that can be replaced quickly and automatically in the event of a problem.

Each commodity machine (slave node) runs both HDFS and MapReduce, and its own control process called TaskTracker. When you need more bandwidth, storage or processing, you simply create more copies of the nodes to make a cluster. Scaling of Hadoop clusters is quite linear—their performance improves in proportion to the nodes added.

One machine is designated the master node and runs a central coordination process called JobTracker, which divides each job into tasks and assigns each task to a slave node TaskTracker. If a node fails, JobTracker notices and simply reassigns its tasks.

How did they make it scalable?

Hadoop makes it easy to write distributed applications. This is the magic: You can write and test a program on one machine, and it will run without change on 400 or even 4,000 machines. JobTracker takes care of dividing up computing jobs into defined pieces and assigning them to TaskTracker nodes in the cluster.

What components of Hadoop are worth learning about?

Hadoop has an ecosystem of components and tools. For example, Pig is a high-level language for writing MapReduce programs, while Hive is an alternative that’s more like SQL, a standard database language. Mahout lets programmers write machine-learning applications that can learn from the data they encounter. Other components are shown in the figure, with more details available in this video.

Where can I learn more about Hadoop?

We recommend these resources for a deeper dive into Hadoop:


The Peaxy Executive Series is designed to explain quickly and simply what business leaders need to know about using big data and data access systems.