Published on

Big Data: Concepts, Architecture, and Technologies

Authors
Article Cover
Table of Contents

Core Concepts of Big Data

Big Data is typically characterized by the "4Vs," which describe the challenges that traditional data processing systems struggle to address:

CharacteristicDescription
VolumeThe sheer scale of data being generated, often reaching petabytes and beyond
VelocityThe speed at which new data is being created and needs to be processed, often in real-time
VarietyThe diverse formats of data—structured, semi-structured, and unstructured
VeracityThe uncertainty, biases, and potential inaccuracies in data requiring verification

Hadoop

Hadoop emerged as one of the first comprehensive frameworks designed specifically for Big Data processing. Its distributed architecture allows for horizontal scaling by simply adding more commodity hardware nodes.

HDFS (Hadoop Distributed File System)

HDFS is a fault-tolerant distributed file system designed to run on commodity hardware in cloud environments. It provides high-throughput access to large datasets by splitting files into blocks and distributing them across a cluster.

  • NameNode (Master): Manages the file system namespace and regulates access to files
  • DataNode (Slave): Stores and retrieves blocks as directed by the NameNode

YARN

YARN, introduced in Hadoop 2, is the resource management platform responsible for managing compute resources on Hadoop clusters. It enables multiple data processing engines to run alongside the traditional MapReduce paradigm.

Key components:

  • Resource Manager: Allocates resources among all applications in the system
  • Scheduler: Plans the execution of processes based on resource availability
  • Node Manager: Monitors resource usage and reports to the Resource Manager

These services work together to optimize execution and resource usage in the Hadoop environment, allowing for more efficient parallel processing.

Main components of Hadoop

Cluster (3 nodes)

A typical Hadoop cluster consists of three types of nodes, each serving a specific purpose in the distributed processing framework.

Edge Node

Edge nodes act as the interface between users and the cluster:

  • Users submit jobs and queries through these nodes (e.g., Spark jobs, Hive queries)
  • They host client tools and interfaces (e.g., Hue, CLI tools)
  • They ensure secure access to the internal cluster

Master Nodes

Master nodes handle coordination and management of resources and tasks:

  • NameNode (HDFS): Tracks where data is stored
  • ResourceManager (YARN): Allocates computing resources
  • Hive Metastore / Impala Catalog: Manages metadata for querying

Multiple master nodes ensure high availability—if one fails, others take over.

Worker Nodes

Worker nodes store the actual data and perform computations:

  • DataNodes (HDFS): Store file blocks
  • NodeManagers (YARN): Execute processing tasks
  • Spark Executors / Hive Workers: Process data in parallel

The more worker nodes, the more data and jobs a cluster can handle.

HDFS: Fault-Tolerant Storage

HDFS, which stands for Hadoop Distributed System, is the cloud file system, can be extended as required by simply adding nodes.

The character is to be a fault-tolerance file systems, this means that there is no loss of information in case of faults or malfunctions

This is guaranteed by replicated system used by HDFS for storing the files on Cloud. In fact, the file are split into multiple packets and replicated within the HDFS. In this way we have multiple copies of the same file.

Hive: SQL Interface for Hadoop

Apache Hive provides a data warehouse infrastructure built on top of Hadoop for data summarization, querying, and analysis. It offers a SQL-like interface to Hadoop, making it accessible to users familiar with traditional database systems.

Hive consists of two main services:

  • Hive Server: The SQL engine that executes queries
  • Metastore: The table catalog that maintains metadata about tables and partitions

HCatalog: Table and Storage Management Layer

HCatalog is a table and storage management layer for Hadoop that provides high-level abstraction to other frameworks like MapReduce, Spark, and Pig by performing I/O operations to the distributed storage layer for Hive tables.

HCatalog comprises 3 key elements:

  • SerDe: Serialization and deserialization library to process various data formats
  • Metastore DB: Stores the schema of Hive tables
  • WebHCat/HCatalog REST: UI/REST layer on top of metastore DB for web clients

HBase: Distributed NoSQL Database

HBase is a distributed, scalable, big data store designed to support random, real-time read/write access to large datasets.

  • Data is partitioned based on RowKeys into Regions
  • Each Region contains a range of RowKeys based on their binary order
  • A RegionServer can contain several Regions
  • All Regions contained in a RegionServer share one write-ahead log (WAL)
  • Regions are automatically split if they become too large

Kafka: Streaming Data Platform

Apache Kafka is a distributed streaming platform that provides three key capabilities:

  1. Publishing and subscribing to streams of records
  2. Storing streams of records in a fault-tolerant way
  3. Processing streams of records as they occur

4 Core APIs

APIDescription
Producer APIAllows applications to publish streams of records to one or more Kafka topics
Consumer APIEnables applications to subscribe to topics and process the streams of records
Streams APISupports applications acting as stream processors, transforming input streams to output streams
Connector APIFacilitates building reusable producers or consumers that connect Kafka topics to existing applications or data systems

Spark

Spark introduces the concept of Resilient Distributed Datasets (RDDs)—immutable, fault-tolerant collections of objects that can be operated on in parallel.

  • RDDs can contain any type of object and are created by loading external datasets or distributing collections from the driver program
  • RDDs support two types of operations:
    • Transformations: Operations (such as map, filter, join, union) that yield a new RDD containing the result
    • Actions: Operations (such as reduce, count, first) that return a value after computation on an RDD

Spark Core is the base engine for large-scale parallel and distributed data processing, responsible for:

  • Memory management and fault recovery
  • Scheduling, distributing, and monitoring jobs on a cluster
  • Interacting with storage systems
Spark Core

Other components include:

  • Spark SQL: Supports querying data via SQL or Hive Query Language
  • Spark Streaming: Enables real-time processing of streaming data
  • MLlib: Provides machine learning algorithms designed to scale out on a cluster
  • GraphX: Offers a library for manipulating and performing graph-parallel operations

PSA - Persistent Staging Area

A Persistent Staging Area (PSA) serves as an intermediate storage layer between source systems and analytical data stores.

  • Large data storage capacity for both structured and unstructured data
  • Ability to trace any data changes over time
  • Data always available via both SQL and file interfaces
  • Scalability through adding new nodes to the cluster
  • Parallel processing engine for analyzing any type of data

Real-time Architecture

Modern Big Data architectures often incorporate real-time processing capabilities to handle data as it arrives, enabling immediate insights and actions.

  1. Real-time Producer: Devices or sensors that generate raw data

    • Production line aggregators
    • IoT devices, web services, or REST APIs
  2. Broker: Buffer to prevent data loss during peak data rates

    • Publish/subscribe messaging system for asynchronous exchange
    • Works as a FIFO buffer to ensure all data is eventually processed
    • Publishers are sensors/devices, subscriber is the real-time engine
  3. Engine: Processes incoming data for predictions, alarms, or automated actions

    • May be implemented as dedicated software, generic frameworks, or custom solutions
  4. Data Lake: Storage for raw data used for statistical analysis or data warehouse enrichment

    • Typically stores data as object blobs or files

Search Engine Integration

Search engines provide powerful capabilities for finding and analyzing data across the Big Data ecosystem.

  • Creates a "virtual" single repository by fusing structured, unstructured, and federated data
  • Provides secure and granular access to all applications and data stores
  • Supports tagging, commenting, organizing, and rating content to enhance navigation and discovery
  • Offers dynamic clustering and text analysis
  • Can ingest domain-specific taxonomies and dictionaries
  • Scales horizontally to handle growing data volumes

References