— blog
All Posts
Let's explore the source of my reflections — on data, AI, and everything that sits between a raw dataset and a good decision — turn complexity into clarity, and share it all in the most naive way.
- Nov 410 min
Digital Twins: Definition, Knowledge Graph Integration and Human Roles
Digital TwinsAn exploration of Digital Twin technologies, their evolution, and the integration of Knowledge Graphs for enhanced capabilities in smart manufacturing and Industry 5.0 applications.
- Nov 16 min
LLM-based Agents: Single and Multi-Agent Systems
LLMsAgentic AIOverview of LLM-based agents, including single-agent and multi-agent systems, core components (planning, memory, tool use), and common coordination and planning patterns.
- Oct 269 min
Preference Alignment: RLHF and DPO
LLMsFine-tuningRLHFDPOAn in-depth exploration of preference alignment techniques for LLMs, including Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO).
- Jul 47 min
Monolithic Data Lake vs Data Mesh
Data EngineeringThis article compares monolithic data lake architecture with the decentralized data mesh approach. While data lakes centralize data for easier access, they face scalability challenges. Data mesh treats data as a product owned by domain teams, enhancing agility through four key principles: domain-oriented ownership, data as a product, self-serve infrastructure, and federated governance.
- Jul 37 min
Big Data: Concepts, Architecture, and Technologies
Data EngineeringData WarehouseThis article explores the world of Big Data, covering core concepts like the 4Vs (Volume, Velocity, Variety, Veracity), key technologies including Hadoop, Kafka, and Spark, and modern architectures such as Persistent Staging Areas and real-time processing systems. It provides a comprehensive overview of the technologies that emerged to address the challenges of modern data growth beyond traditional database capabilities.
- Jul 36 min
Data Integration - Part 1: ETL, Pushdown and Data Orchestrator
Data EngineeringData WarehouseETLELTThis article discusses data integration, focusing on combining data from multiple sources into a unified view for better analysis. It contrasts ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) methodologies, explaining their approaches to data handling. The article also highlights three types of ETL pushdown techniques and the concept of data orchestrators.
- Jul 26 min
Data Integration - Part 2: Loading Strategies, Change Data Capture and Data Layers
Data EngineeringData WarehouseCDCThis article explores data loading methodologies, including batch and streaming approaches, and various loading strategies such as full and incremental loads. It also examines Change Data Capture (CDC) techniques and the layered architecture of modern data warehouses, from raw data ingestion to presentation marts.
- Jul 24 min
Dimensional Modeling - Part 3: Dimensions Hierarchy
Data WarehouseDatabaseData EngineeringData ModelingThis article examines various hierarchy types in data modeling, including Fixed Depth Positional Hierarchies, Slightly Ragged Hierarchies, and Ragged Hierarchies. Fixed Depth Hierarchies feature clear many-to-one relationships, such as product to brand, allowing for easy navigation and quick queries. The article discusses strategies for managing ragged hierarchies through the use of bridge tables and pathstring attributes to simplify analysis and improve performance.
- Jul 16 min
Data Modeling
Data WarehouseDatabaseData ModelingData EngineeringThis article explores the importance of Data Modeling as a foundational blueprint for organizing information within a business, aiding in the development of a data warehouse. It emphasizes the role of a Logical Data Model (LDM) in establishing frameworks for business intelligence and analytics, ensuring data consistency, quality, and effective communication. The article also contrasts Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) models and summarizes the transition from a Logical to a Physical Data Model (PDM) for enhanced database performance.
- Jul 13 min
Dimensional Modeling - Part 0: 4-Step Design Process
Data WarehouseDatabaseData EngineeringData ModelingThis article outlines a structured approach to Data Design, detailing essential steps for creating effective data models that align with business needs. It covers gathering business requirements, conducting collaborative workshops, and following the Four-Step Dimensional Design Process. The article also features a use case for modeling data in a Fast Food outlet and discusses managing changes in dimension data using Slowly Changing Dimensions (SCDs).
- Jun 3011 min
ER Modeling and Normalization
DatabaseData ModelingThis article covers Entity-Relationship (E-R) Modeling, which visually depicts the relationships between identifiable entities in a database. It discusses key concepts such as entities and their attributes, relationships and cardinality, and the importance of normalization (1NF, 2NF, 3NF) to optimize data structure and reduce redundancy for improved data integrity and efficiency.
- Jun 2912 min
Relational Database Management System
DatabaseData EngineeringThis article provides an overview of Relational Database Management Systems (RDBMS), covering key features such as data representation through tables, transaction principles (Atomicity, Consistency, Isolation, Durability), and concepts like primary and foreign keys. It highlights the importance of data integrity, security, normalization, and referential integrity in maintaining valid relationships between tables for effective data management.
- Mar 148 min
Retrieval Augmented Generation (RAG) with Vector Databases
LLMsRAGVector DatabasesContext EngineeringThe article delves into Retrieval-Augmented Generation (RAG), which integrates retrieval and generative models to enhance GenAI applications efficiently. It highlights the architecture of RAG, utilizing vector databases for data retrieval and response generation.
- Dec 1218 min
From Transformer to LLMs
LLMsRAGPrompt EngineeringFine-tuningThis article explores the evolution from Transformers to Large Language Models (LLMs), detailing the mechanisms of self-attention and multi-head attention, the role of position embeddings, various types of transformer models, and the training and fine-tuning processes of LLMs.
- Jul 111 min
LLMs in LangChain - Part 2. LLMs Core Concepts
LLMsAgentic AIThis article provides an overview of the core concepts of Large Language Models (LLMs) in LangChain, including LLM components, prompt templates, indexing, memory, chains, and agents.
- Jun 306 min
LLMs in LangChain - Part 1. Conceptual
LLMsAgentic AIThis article provides an overview of Large Language Models (LLMs), LLMs evolution and the core concepts of LangChain - an open source framework for building applications based on LLMs.
- Apr 19 min
Vector Database - Pinecone
DatabaseVector DatabasesIn this article, we delve into Vector Databases by using Pinecone and explore the fundamentals of vector embeddings, indexes, and essential components of these databases. Furthermore, it will provide a guide on setting up a Vector Database on Pinecone, walking through the installation process, obtaining API keys, and initializing client connections.
- Jan 67 min
Principle of Design in Data Visualization
Data VisualizationA data visualization, while creatively pleasing, should also serve a functional purpose in effectively communicating data - which refer as combining both art and science. In this context, we can refer to data visualization as a modern art form. Instead of providing step-by-step instructions for enhancing your dashboard, this article will introduce seven design principles as a solid foundation for this “modern art” field.
- Nov 16 min
Gaussian Naive Bayes
Machine LearningWe delve into the intricacies of Gaussian Naive Bayes classification. The focus is on determining the probability of a data point belonging to a specific class among several, emphasizing probabilistic assessment over precise labeling. The article breaks down key concepts, from Bayesian decision theory to Bayes' theorem, and provides a step-by-step implementation using the Iris dataset.
- Aug 89 min
Dimensional Modeling - Part 2: Basic Dimension Table Techniques
Data WarehouseDatabaseData EngineeringData ModelingDimensional ModelingThe topics covered include Degenerate Dimension, Conformed Dimension, Role-Playing Dimension, Junk Dimension, Outrigger Dimension, and Slowly Changing Dimensions (SCD). The SCD category further delves into different types, such as Type 0 to Type 7, each with its unique approach to handling historical and changing data.
- Apr 213 min
Dimensional Modeling - Part 1: Basic Fact Table Techniques
Data WarehouseDatabaseData EngineeringData ModelingDimensional ModelingIn this article, I will introduce the concept of the Basic Fact table in Dimensional data modeling. To understand this technique, we will explore the different types of data modeling and recap some fundamental knowledge, including the star and snowflake schemas, and the concepts of normalization.
- Dec 185 min
Probability and Statistics: Two Sides of the Same Coin
StatisticsSome folks asked me about statistics - probability, toward which I answered that I had only studied a bit and knew very little about statistics. Then they said are these are one?. In fact, Statistics and Probability are distinct from one another
- Sep 256 min
Colors and Data
Data VisualizationColor makes a chart look better and makes it easier for people to understand the data it shows. Based on the types of data, the colors used for data visualization can be put into three groups: categorical colors, sequential colors, and diverging colors.
- Aug 187 min
Think About Data
Data AwarenessData InformedData DrivenThis topic discusses how we think about data between data aware, data informed, and data driven - as a data strategy for various jobs or in daily life. It is a part of sharing session which I presented in Fossil Vietnam on August 2022.