- Published on
Big Data: Concepts, Architecture, and Technologies
- Authors
- Name
- Mai TIEU Khoi
- @tieukhoimai

Table of Contents
Core Concepts of Big Data
Big Data is typically characterized by the "4Vs," which describe the challenges that traditional data processing systems struggle to address:
Characteristic | Description |
---|---|
Volume | The sheer scale of data being generated, often reaching petabytes and beyond |
Velocity | The speed at which new data is being created and needs to be processed, often in real-time |
Variety | The diverse formats of data—structured, semi-structured, and unstructured |
Veracity | The uncertainty, biases, and potential inaccuracies in data requiring verification |
Hadoop
Hadoop emerged as one of the first comprehensive frameworks designed specifically for Big Data processing. Its distributed architecture allows for horizontal scaling by simply adding more commodity hardware nodes.
HDFS (Hadoop Distributed File System)
HDFS is a fault-tolerant distributed file system designed to run on commodity hardware in cloud environments. It provides high-throughput access to large datasets by splitting files into blocks and distributing them across a cluster.
- NameNode (Master): Manages the file system namespace and regulates access to files
- DataNode (Slave): Stores and retrieves blocks as directed by the NameNode
YARN
YARN, introduced in Hadoop 2, is the resource management platform responsible for managing compute resources on Hadoop clusters. It enables multiple data processing engines to run alongside the traditional MapReduce paradigm.
Key components:
- Resource Manager: Allocates resources among all applications in the system
- Scheduler: Plans the execution of processes based on resource availability
- Node Manager: Monitors resource usage and reports to the Resource Manager
These services work together to optimize execution and resource usage in the Hadoop environment, allowing for more efficient parallel processing.
Main components of Hadoop
Cluster (3 nodes)
A typical Hadoop cluster consists of three types of nodes, each serving a specific purpose in the distributed processing framework.
Edge Node
Edge nodes act as the interface between users and the cluster:
- Users submit jobs and queries through these nodes (e.g., Spark jobs, Hive queries)
- They host client tools and interfaces (e.g., Hue, CLI tools)
- They ensure secure access to the internal cluster
Master Nodes
Master nodes handle coordination and management of resources and tasks:
- NameNode (HDFS): Tracks where data is stored
- ResourceManager (YARN): Allocates computing resources
- Hive Metastore / Impala Catalog: Manages metadata for querying
Multiple master nodes ensure high availability—if one fails, others take over.
Worker Nodes
Worker nodes store the actual data and perform computations:
- DataNodes (HDFS): Store file blocks
- NodeManagers (YARN): Execute processing tasks
- Spark Executors / Hive Workers: Process data in parallel
The more worker nodes, the more data and jobs a cluster can handle.
HDFS: Fault-Tolerant Storage
HDFS, which stands for Hadoop Distributed System, is the cloud file system, can be extended as required by simply adding nodes.
The character is to be a fault-tolerance file systems, this means that there is no loss of information in case of faults or malfunctions
This is guaranteed by replicated system used by HDFS for storing the files on Cloud. In fact, the file are split into multiple packets and replicated within the HDFS. In this way we have multiple copies of the same file.
Hive: SQL Interface for Hadoop
Apache Hive provides a data warehouse infrastructure built on top of Hadoop for data summarization, querying, and analysis. It offers a SQL-like interface to Hadoop, making it accessible to users familiar with traditional database systems.
Hive consists of two main services:
- Hive Server: The SQL engine that executes queries
- Metastore: The table catalog that maintains metadata about tables and partitions
HCatalog: Table and Storage Management Layer
HCatalog is a table and storage management layer for Hadoop that provides high-level abstraction to other frameworks like MapReduce, Spark, and Pig by performing I/O operations to the distributed storage layer for Hive tables.
HCatalog comprises 3 key elements:
- SerDe: Serialization and deserialization library to process various data formats
- Metastore DB: Stores the schema of Hive tables
- WebHCat/HCatalog REST: UI/REST layer on top of metastore DB for web clients
HBase: Distributed NoSQL Database
HBase is a distributed, scalable, big data store designed to support random, real-time read/write access to large datasets.
- Data is partitioned based on RowKeys into Regions
- Each Region contains a range of RowKeys based on their binary order
- A RegionServer can contain several Regions
- All Regions contained in a RegionServer share one write-ahead log (WAL)
- Regions are automatically split if they become too large
Kafka: Streaming Data Platform
Apache Kafka is a distributed streaming platform that provides three key capabilities:
- Publishing and subscribing to streams of records
- Storing streams of records in a fault-tolerant way
- Processing streams of records as they occur
4 Core APIs
API | Description |
---|---|
Producer API | Allows applications to publish streams of records to one or more Kafka topics |
Consumer API | Enables applications to subscribe to topics and process the streams of records |
Streams API | Supports applications acting as stream processors, transforming input streams to output streams |
Connector API | Facilitates building reusable producers or consumers that connect Kafka topics to existing applications or data systems |
Spark
Spark introduces the concept of Resilient Distributed Datasets (RDDs)—immutable, fault-tolerant collections of objects that can be operated on in parallel.
- RDDs can contain any type of object and are created by loading external datasets or distributing collections from the driver program
- RDDs support two types of operations:
- Transformations: Operations (such as map, filter, join, union) that yield a new RDD containing the result
- Actions: Operations (such as reduce, count, first) that return a value after computation on an RDD
Spark Core is the base engine for large-scale parallel and distributed data processing, responsible for:
- Memory management and fault recovery
- Scheduling, distributing, and monitoring jobs on a cluster
- Interacting with storage systems

Other components include:
- Spark SQL: Supports querying data via SQL or Hive Query Language
- Spark Streaming: Enables real-time processing of streaming data
- MLlib: Provides machine learning algorithms designed to scale out on a cluster
- GraphX: Offers a library for manipulating and performing graph-parallel operations
PSA - Persistent Staging Area
A Persistent Staging Area (PSA) serves as an intermediate storage layer between source systems and analytical data stores.
- Large data storage capacity for both structured and unstructured data
- Ability to trace any data changes over time
- Data always available via both SQL and file interfaces
- Scalability through adding new nodes to the cluster
- Parallel processing engine for analyzing any type of data
Real-time Architecture
Modern Big Data architectures often incorporate real-time processing capabilities to handle data as it arrives, enabling immediate insights and actions.
Real-time Producer: Devices or sensors that generate raw data
- Production line aggregators
- IoT devices, web services, or REST APIs
Broker: Buffer to prevent data loss during peak data rates
- Publish/subscribe messaging system for asynchronous exchange
- Works as a FIFO buffer to ensure all data is eventually processed
- Publishers are sensors/devices, subscriber is the real-time engine
Engine: Processes incoming data for predictions, alarms, or automated actions
- May be implemented as dedicated software, generic frameworks, or custom solutions
Data Lake: Storage for raw data used for statistical analysis or data warehouse enrichment
- Typically stores data as object blobs or files
Search Engine Integration
Search engines provide powerful capabilities for finding and analyzing data across the Big Data ecosystem.
- Creates a "virtual" single repository by fusing structured, unstructured, and federated data
- Provides secure and granular access to all applications and data stores
- Supports tagging, commenting, organizing, and rating content to enhance navigation and discovery
- Offers dynamic clustering and text analysis
- Can ingest domain-specific taxonomies and dictionaries
- Scales horizontally to handle growing data volumes
References
- Lecture notes from Business Analytics course, Prof. Paolo Menna, University of Verona 2024-2025
- Self Learning Java Tutorial - What is Edge Node in Hadoop
- Stack Overflow - What is use of HCatalog in Hadoop