As early as 2001, Gartner analyst Doug Laney introduced the 3Vs concept, or model, which described the expansion of Big Data on 3 frontiers: volume, velocity, and variety.
The volume component refers to the increased quantities of data that organizations would have access to in the future. Early data storage was measured in kilobytes, but as data collection and storage developed, organizations began to collect and store megabytes, gigabytes, terabytes and eventually petabytes (and beyond) of data.
The velocity component of Laney's model addressed how quickly data would move from its original source to the users, applications, and systems that make use of it. This represents the transition from batch data analysis towards near real-time and eventually real-time data transfer between source and destination.
The final component of the 3V concept is data variety. Laney saw that as the volume of data increased, so too would the variety of data sources, which would include social media signals, video and image metadata, audio and other user and computer-generated data types.
In the early 2000s, a group of software engineers at the Apache Software Foundation, an NPO that builds open-source software, understood that new software tools would be needed to help organizations capitalize on the increasing availability of Big Data. The group created an open-source tool called Apache Hadoop that would allow users to store and analyze data sets that were much larger than could be stored or managed on a single physical storage device.
As an individual hard drive increases in size, the process of reading data from specified parts of the drive starts to become prohibitively resource-intensive, especially when dealing with a high volume of data. The developers of Hadoop understood that it would be more efficient to store data on many separate devices working in parallel, as opposed to a single hard disk. As a result, the Hadoop Architecture was designed to allow many data storage devices to work in parallel instead of one large one. This innovation has made Hadoop one of the most popular data processing platforms for as many as 50% of Fortune 500 companies.
Hadoop offers a flexible platform that makes it easy for organizations to modify their data systems, but the basic Hadoop architecture should remain stable across deployments. There are five essential building blocks that underlie the Apache Hadoop Architecture and help to deliver the functions that organizations rely on for data management and processing capabilities.
Cluster - A cluster represents the hardware portion of the Hadoop infrastructure. Instead of storing and reading data from a single hard disk, Hadoop uses a set of host machines called a "cluster". The machines in a cluster are often referred to as "nodes" and they can be partitioned in "racks" (as in, server racks). Large IT organizations implement clusters with hundreds or even thousands of nodes to support their data processing needs.
YARN Infrastructure - The YARN (Yet Another Resource Negotiator) infrastructure is a framework of tools that supply the CPUs, memory and other computational resources needed for Hadoop's data processing functions. The YARN infrastructure has three important elements:
- Resource Manager - Each cluster has one resource manager that performs a variety of functions. The resource manager is the "master" of the node managers - it works to schedule resources, handle events, and monitor the activity of nodes in the cluster as well as applications.
- Node Manager - Each cluster has several node managers whose primary role is to communicate resource availability between the nodes and the resource manager. The resource manager collects information about the availability of working capacity from node managers and determines how this capacity should be used. The capacity offered by a specific node manager can be further segmented into Containers.
- Job Submitter - The job submitter is the client - a piece of hardware or software that accesses a service that is available on a server (or in Hadoop's case, on a cluster) through a network (more on this later)
HDFS Federation - HDFS Federation is a way of creating and maintaining permanent, reliable and distributed data storage within the Hadoop Architecture. There are two parts of HDFS: NameSpace and Block Storage. NameSpace is responsible for file handling and storing metadata about the system, while Block Storage is involved with block handling and actual data storage. HDFS federation allows for horizontal scaling of NameNodes, so that the cluster will still be available if a single NameNode fails.
Storage - Additional storage may be used as part of the Hadoop Architecture, for example, Amazon's Simple Storage Service (S3)
MapReduce Framework - The MapReduce framework represents a layer of software that sits on top of the other Hadoop Architecture components and implements the MapReduce paradigm. This software reads data from the database and maps it into a format that is more suitable for analysis and presentation to humans. MapReduce can also perform basic mathematical operations on data to simplify its presentation and reveal insights.
The essence of the Hadoop Architecture is that it can deliver data processing services faster and more efficiently than using a single hard disk.
The Resource Manager receives communication from nodes and keeps track of resource availability within the cluster. When a client (Job Submitter) wants to run an application, the Resource Manager assigns a Container (along with its memory and virtual processing power) to service the request. The Resource Manager contacts the Node Manager for that container and the Node Manager launches the Container. The resources from that Container are then used to execute the Application Master, which may request more containers from the Resource Manager and execute additional programs using those Containers.
Once the requested operations are completed, the data is stored in HDFS and later processed by MapReduce into a usable format and human-readable results.
One of the major challenges of Hadoop Architecture is its technical complexity. Assembling huge systems to support large volumes of data, such as computer-generated log data from a hybridized cloud IT environment can pose significant technical challenges. Sumo Logic handles the complexities associated with deploying, managing and upgrading big data processing systems so customers can get the benefits of big data processing without the huge administrative overhead and up-front costs of deploying their own big data processing infrastructure.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.