Glossary

Hadoop architecture


A


B


C


D


E


F


G


H


I


J


K


L


M


N


O


P


Q


R


S


T


U


V


W


X


Y


Z

    In the early 2000s, a group of software engineers at the Apache Software Foundation, an NPO that builds open-source software, understood that new software tools would be needed to help organizations capitalize on the increasing availability of Big Data. The group created an open-source tool called Apache Hadoop that would allow users to store and analyze data sets that were much larger than could be stored or managed on a single physical storage device.

    Hadoop offers a flexible platform that makes it easy for organizations to modify their data systems, but the basic Hadoop architecture should remain stable across deployments. There are five essential building blocks that underlie the Apache Hadoop architecture and help to deliver the functions that organizations rely on for data management and processing capabilities.

    Cluster – A cluster represents the hardware portion of the Hadoop infrastructure. Instead of storing and reading data from a single hard disk, Hadoop uses a set of host machines called a “cluster.” The machines in a cluster are often referred to as “nodes” and they can be partitioned into “racks” (as in, server racks). Large IT organizations implement clusters with hundreds or even thousands of nodes to support their data processing needs.

    YARN Infrastructure – The YARN (Yet Another Resource Negotiator) infrastructure is a framework of tools that supply the CPUs, memory and other computational resources needed for Hadoop’s data processing functions. The YARN infrastructure has three important elements:

    • Resource manager – Each cluster has one resource manager that performs a variety of functions. The resource manager is the “master” of the node managers – it works to schedule resources, handle events, and monitor the activity of nodes in the cluster as well as applications.
    • Node manager – Each cluster has several node managers whose primary role is to communicate resource availability between the nodes and the resource manager. The resource manager collects information about the availability of working capacity from node managers and determines how this capacity should be used. The capacity offered by a specific node manager can be further segmented into containers.
    • Job submitter – The job submitter is the client – a piece of hardware or software that accesses a service that is available on a server (or in Hadoop’s case, on a cluster) through a network (more on this later)

    HDFS Federation – HDFS Federation is a way of creating and maintaining permanent, reliable and distributed data storage within the Hadoop Architecture. There are two parts of HDFS: NameSpace and Block Storage. NameSpace is responsible for file handling and storing metadata about the system, while Block Storage is involved with block handling and actual data storage. HDFS federation allows for horizontal scaling of NameNodes, so that the cluster will still be available if a single NameNode fails.

    Storage – Additional storage may be used as part of the Hadoop architecture, for example, Amazon’s Simple Storage Service (S3)

    The essence of the Hadoop architecture is that it can deliver data processing services faster and more efficiently than using a single hard disk.

    The Resource Manager receives communication from nodes and keeps track of resource availability within the cluster. When a client (Job Submitter) wants to run an application, the Resource Manager assigns a container (along with its memory and virtual processing power) to service the request. The Resource Manager contacts the Node Manager for that container and the Node Manager launches the container. The resources from that container are then used to execute the Application Master, which may request more containers from the Resource Manager and execute additional programs using those containers.

    Once the requested operations are completed, the data is stored in HDFS and later processed by MapReduce into a usable format and human-readable results.

    Sumo Logic helps IT organizations manage Hadoop architecture complexities