Pricing Login Free trial Support
All an engineer has to do is click a link, and they have everything they need in one place. That level of integration and simplicity helps us respond faster and more effectively.
Sajeeb Lohani
Global Technical Information Security Officer (TISO), Bugcrowd
Read case study

Glossary

Hadoop architecture


A


B


C


D


E


F


G


H


I


J


K


L


M


N


O


P


Q


R


S


T


U


V


W


X


Y


Z

Table of contents

    Hadoop architecture is the framework that underlies Apache Hadoop, an open-source platform designed to store, manage, and process large data sets across clusters of computers using distributed storage and parallel processing. Developed by the Apache Software Foundation, Hadoop enables organizations to handle big data that cannot fit on a single physical storage device, making it a cornerstone of modern data processing and big data analytics.

    Hadoop offers a flexible platform that makes it easy for organizations to modify their data systems, but the basic Hadoop architecture should remain stable across deployments. There are five essential building blocks that underlie the Apache Hadoop architecture and help to deliver the functions that organizations rely on for data management and processing capabilities.

    Cluster – A cluster represents the hardware portion of the Hadoop infrastructure. Instead of storing and reading data from a single hard disk, Hadoop uses a set of host machines called a “cluster.” The machines in a cluster are often referred to as “nodes” and they can be partitioned into “racks” (as in, server racks). Large IT organizations implement clusters with hundreds or even thousands of nodes to support their data processing needs.

    YARN Infrastructure – The YARN (Yet Another Resource Negotiator) infrastructure is a framework of tools that supply the CPUs, memory and other computational resources needed for Hadoop’s data processing functions. The YARN infrastructure has three important elements:

    • Resource manager – Each cluster has one resource manager that performs a variety of functions. The resource manager is the “master” of the node managers – it works to schedule resources, handle events, and monitor the activity of nodes in the cluster as well as applications.
    • Node manager – Each cluster has several node managers whose primary role is to communicate resource availability between the nodes and the resource manager. The resource manager collects information about the availability of working capacity from node managers and determines how this capacity should be used. The capacity offered by a specific node manager can be further segmented into containers.
    • Job submitter – The job submitter is the client – a piece of hardware or software that accesses a service that is available on a server (or in Hadoop’s case, on a cluster) through a network (more on this later)

    HDFS Federation – HDFS Federation is a way of creating and maintaining permanent, reliable and distributed data storage within the Hadoop Architecture. There are two parts of HDFS: NameSpace and Block Storage. NameSpace is responsible for file handling and storing metadata about the system, while Block Storage is involved with block handling and actual data storage. HDFS federation allows for horizontal scaling of NameNodes, so that the cluster will still be available if a single NameNode fails.

    Storage – Additional storage may be used as part of the Hadoop architecture, for example, Amazon’s Simple Storage Service (S3)

    The essence of the Hadoop architecture is that it can deliver data processing services faster and more efficiently than using a single hard disk.

    The Resource Manager receives communication from nodes and keeps track of resource availability within the cluster. When a client (Job Submitter) wants to run an application, the Resource Manager assigns a container (along with its memory and virtual processing power) to service the request. The Resource Manager contacts the Node Manager for that container and the Node Manager launches the container. The resources from that container are then used to execute the Application Master, which may request more containers from the Resource Manager and execute additional programs using those containers.

    Once the requested operations are completed, the data is stored in HDFS and later processed by MapReduce into a usable format and human-readable results.

    Sumo Logic helps IT organizations manage Hadoop architecture complexities

    See how it works for yourself. Try Sumo Logic with a 30-day free trial.

    FAQs

    Log files are crucial for infrastructure management as they provide valuable insights into the performance, security and health of the IT infrastructure. By analyzing log files generated by different components such as servers, applications and network devices, IT professionals can monitor system activities, identify issues, troubleshoot problems, track user actions, and ensure system reliability. Log files are used to detect anomalies, troubleshoot performance issues, monitor security events, track changes made to the infrastructure, and analyze trends for capacity planning.

    Sumo Logic provides an end-to-end approach to monitoring and troubleshooting. Quickly detect anomalous events via pre-set alerts, then enable rapid root cause analysis through machine learning-aided technology and robust querying capabilities for your logs and metrics. Beyond getting to the root cause of issues in the moment, capabilities like our predict operator for querying logs or metrics can also help you plan for the future — preempting bottlenecks and informing infrastructure capacity planning.

    Compared to other infrastructure monitoring solutions, Sumo Logic supports log data with a professional-grade query language and standard security for all users, including encryption-at-rest and security attestations (PCI, HIPAA, FISMA, SOC2, GDPR, etc.) and FedRAMP — at no additional charge.