Big Data: Concepts, Architecture, and Technologies

Table of Contents

Core Concepts of Big Data

Big Data is typically characterized by the "4Vs," which describe the challenges that traditional data processing systems struggle to address:

Characteristic	Description
Volume	The sheer scale of data being generated, often reaching petabytes and beyond
Velocity	The speed at which new data is being created and needs to be processed, often in real-time
Variety	The diverse formats of data—structured, semi-structured, and unstructured
Veracity	The uncertainty, biases, and potential inaccuracies in data requiring verification

Hadoop

Hadoop emerged as one of the first comprehensive frameworks designed specifically for Big Data processing. Its distributed architecture allows for horizontal scaling by simply adding more commodity hardware nodes.

HDFS (Hadoop Distributed File System)

HDFS is a fault-tolerant distributed file system designed to run on commodity hardware in cloud environments. It provides high-throughput access to large datasets by splitting files into blocks and distributing them across a cluster.

NameNode (Master): Manages the file system namespace and regulates access to files
DataNode (Slave): Stores and retrieves blocks as directed by the NameNode

YARN

YARN, introduced in Hadoop 2, is the resource management platform responsible for managing compute resources on Hadoop clusters. It enables multiple data processing engines to run alongside the traditional MapReduce paradigm.

Key components:

Resource Manager: Allocates resources among all applications in the system
Scheduler: Plans the execution of processes based on resource availability
Node Manager: Monitors resource usage and reports to the Resource Manager

These services work together to optimize execution and resource usage in the Hadoop environment, allowing for more efficient parallel processing.

Main components of Hadoop

Cluster (3 nodes)

A typical Hadoop cluster consists of three types of nodes, each serving a specific purpose in the distributed processing framework.

Source: Self Learning Java Tutorial - What is Edge Node in Hadoop

Edge Node

Edge nodes act as the interface between users and the cluster:

Users submit jobs and queries through these nodes (e.g., Spark jobs, Hive queries)
They host client tools and interfaces (e.g., Hue, CLI tools)
They ensure secure access to the internal cluster

Master Nodes

Master nodes handle coordination and management of resources and tasks:

NameNode (HDFS): Tracks where data is stored
ResourceManager (YARN): Allocates computing resources
Hive Metastore / Impala Catalog: Manages metadata for querying

Multiple master nodes ensure high availability—if one fails, others take over.

Worker Nodes

Worker nodes store the actual data and perform computations:

DataNodes (HDFS): Store file blocks
NodeManagers (YARN): Execute processing tasks
Spark Executors / Hive Workers: Process data in parallel

The more worker nodes, the more data and jobs a cluster can handle.

HDFS: Fault-Tolerant Storage

HDFS, which stands for Hadoop Distributed System, is the cloud file system, can be extended as required by simply adding nodes.

The character is to be a fault-tolerance file systems, this means that there is no loss of information in case of faults or malfunctions

This is guaranteed by replicated system used by HDFS for storing the files on Cloud. In fact, the file are split into multiple packets and replicated within the HDFS. In this way we have multiple copies of the same file.

Hive: SQL Interface for Hadoop

Apache Hive provides a data warehouse infrastructure built on top of Hadoop for data summarization, querying, and analysis. It offers a SQL-like interface to Hadoop, making it accessible to users familiar with traditional database systems.

Hive consists of two main services:

Hive Server: The SQL engine that executes queries
Metastore: The table catalog that maintains metadata about tables and partitions

HCatalog: Table and Storage Management Layer

HCatalog is a table and storage management layer for Hadoop that provides high-level abstraction to other frameworks like MapReduce, Spark, and Pig by performing I/O operations to the distributed storage layer for Hive tables.

HCatalog comprises 3 key elements:

SerDe: Serialization and deserialization library to process various data formats
Metastore DB: Stores the schema of Hive tables
WebHCat/HCatalog REST: UI/REST layer on top of metastore DB for web clients

Source: Stack Overflow - What is use of HCatalog in Hadoop

HBase: Distributed NoSQL Database

HBase is a distributed, scalable, big data store designed to support random, real-time read/write access to large datasets.

Data is partitioned based on RowKeys into Regions
Each Region contains a range of RowKeys based on their binary order
A RegionServer can contain several Regions
All Regions contained in a RegionServer share one write-ahead log (WAL)
Regions are automatically split if they become too large

Kafka: Streaming Data Platform

Apache Kafka is a distributed streaming platform that provides three key capabilities:

Publishing and subscribing to streams of records
Storing streams of records in a fault-tolerant way
Processing streams of records as they occur

4 Core APIs

API	Description
Producer API	Allows applications to publish streams of records to one or more Kafka topics
Consumer API	Enables applications to subscribe to topics and process the streams of records
Streams API	Supports applications acting as stream processors, transforming input streams to output streams
Connector API	Facilitates building reusable producers or consumers that connect Kafka topics to existing applications or data systems

Spark

Spark introduces the concept of Resilient Distributed Datasets (RDDs)—immutable, fault-tolerant collections of objects that can be operated on in parallel.

RDDs can contain any type of object and are created by loading external datasets or distributing collections from the driver program
RDDs support two types of operations:
- Transformations: Operations (such as map, filter, join, union) that yield a new RDD containing the result
- Actions: Operations (such as reduce, count, first) that return a value after computation on an RDD

Spark Core is the base engine for large-scale parallel and distributed data processing, responsible for:

Memory management and fault recovery
Scheduling, distributing, and monitoring jobs on a cluster
Interacting with storage systems

Other components include:

Spark SQL: Supports querying data via SQL or Hive Query Language
Spark Streaming: Enables real-time processing of streaming data
MLlib: Provides machine learning algorithms designed to scale out on a cluster
GraphX: Offers a library for manipulating and performing graph-parallel operations

PSA - Persistent Staging Area

A Persistent Staging Area (PSA) serves as an intermediate storage layer between source systems and analytical data stores.

Large data storage capacity for both structured and unstructured data
Ability to trace any data changes over time
Data always available via both SQL and file interfaces
Scalability through adding new nodes to the cluster
Parallel processing engine for analyzing any type of data

Real-time Architecture

Modern Big Data architectures often incorporate real-time processing capabilities to handle data as it arrives, enabling immediate insights and actions.

Real-time Producer: Devices or sensors that generate raw data
- Production line aggregators
- IoT devices, web services, or REST APIs
Broker: Buffer to prevent data loss during peak data rates
- Publish/subscribe messaging system for asynchronous exchange
- Works as a FIFO buffer to ensure all data is eventually processed
- Publishers are sensors/devices, subscriber is the real-time engine
Engine: Processes incoming data for predictions, alarms, or automated actions
- May be implemented as dedicated software, generic frameworks, or custom solutions
Data Lake: Storage for raw data used for statistical analysis or data warehouse enrichment
- Typically stores data as object blobs or files

Search Engine Integration

Search engines provide powerful capabilities for finding and analyzing data across the Big Data ecosystem.

Creates a "virtual" single repository by fusing structured, unstructured, and federated data
Provides secure and granular access to all applications and data stores
Supports tagging, commenting, organizing, and rating content to enhance navigation and discovery
Offers dynamic clustering and text analysis
Can ingest domain-specific taxonomies and dictionaries
Scales horizontally to handle growing data volumes

References

Lecture notes from Business Analytics course, Prof. Paolo Menna, University of Verona 2024-2025
Self Learning Java Tutorial - What is Edge Node in Hadoop
Stack Overflow - What is use of HCatalog in Hadoop