- Published on
Retrieval Augmented Generation (RAG) with Vector Databases
- Authors
- Name
- Mai TIEU Khoi
- @tieukhoimai

Table of Contents
I. Why is Retrieval-augmented generation (RAG)
Retrieval-Augmented Generation refers to an advanced natural language processing technique that combines the strengths of both retrieval models and generative models.
Source: RAG Vs VectorDB
RAG is the preferred approach from a cost-efficacy perspective. While there are 3 main strategies for enhancing GenAI applications. Let’s examine why RAG better for its practical effectiveness.
1. Fine-tuning and Parameter-efficient Fine-tuning (PEFT)
- For smaller models like BERT, we can easily do “full” fine-tuning
- For larger LLMs (like Llama, Phi, Mistral, Qwen, …), this is more complicated
- Only optimize a small fraction of parameters. So-called Parameter-Efficient Fine Tuning (PEFT)
- Adapters: Insert small trainable modules between layers
- LoRA (Low-Rank Adaptation): Modifies the model weights using lowrank matrices
- Soft Prompting: Appends trainable vectors to the input prompt
- IA3 (Intrinsic Attention-guided Adaptation): Adjusts attention mechanism in transformer layers
However, it involves recurring, costly, and time-intensive labeling by experts, along with constant quality monitoring due to data changes and model accuracy drift.
2. Model alignment
Getting the model to output “what we want” is often called alignment. We seek to “align” the model with certain values & rules of conduct. This is not purely technological; it involves deep ethical and political considerations.
Two alignment strategies are Instruction tuning and Preference alignment
2.1. Instruction Tuning
Classical fine-tuning on a language modelling objective (= next token prediction), but on a dataset of <instruction, correct response>
pairs.
How to collect such dataset?
- Manually, by asking humans to write a question/problem and a correct answer to that problem
- By converting existing datasets for supervised NLP tasks (for text classification, sentiment analysis, NER, machine translation, etc.) to prompts → answers
- By using LLMs to generate instructions-> responses, based on external databases, general guidelines, etc.
2.2. Preference Alignment
One can also improve the model alignment using preference data, such as ratings of answers to prompts
- Several formats are possible, but a common one is to use triples
<prompt, chosen response, reject response>
- The ratings can be collected by paying crowdworkers (based on guidelines that define what constitutes a “good” response), or by asking users to rate the responses they get
Given a dataset of human preferences, the goal is to optimize the LLM to maximise the probability of getting a positive rating, and minimize the probability of a negative rating. Several optimization methods are possible, most prominently Reinforcement Learning from Human Feedback (RLHR) and Direct Preference Optimization (DPO).
Despite its advantages, model alignment is complex, involving constant refinement to ensure models remain aligned with evolving ethical and user preferences. It can be resource-intensive, requiring substantial manual effort to curate datasets and provide feedback.
3. Prompting Engineering
A prompt is simply the input to the LLM, which can be a question or instructions on the task to perform. Many LLMs distinguish between the system and user prompt:
- System prompt: Generic instructions, defined by the system provider, on how the system should respond (be helpful, avoid offensive language, etc.) and how it should present itself to the user (“You are ChatGPT, a large language model trained by OpenAI…”)
- User prompt: Specific instructions provided by the system
Prompt engineering involves experimenting with and fine-tuning the instructions given to a model to elicit desired performances. This approach is the most cost-effective method to enhance the precision of your GenAI application since prompt adjustments can be implemented swiftly with minimal code modifications.
Though this technique refines the outputs generated by LLMs, it does not equip them with new or dynamic context. Consequently, GenAI applications might still encounter limitations in accessing up-to-date information, making them prone to inaccuracies or context misinterpretations (often referred to as "hallucinations").
II. What is RAG
1. Basic Idea
RAG combines two key processes to enhance response generation:
- Retriever = Given a user prompt, we first search for relevant documents in a text database
- Sparse retrieval: Utilizes algorithms that search through indexed terms.
- Dense retrievers: Employs document embeddings to find semantically relevant texts.
- Retrieve data from non-text sources
- Generative = The most relevant texts are then retrieved and added to the prompt to generate the response
Given the following information: [RETRIEVED TEXTS]
Answer the following question: [USER QUESTION]
2. RAG architecture with Vector Database
The architecture for RAGs is implemented using transformers consisting of two parts
- An encoder: when a user asks a question, the input text
encoded
into vectors capturing the meaning of words - A decoder: The vectors are
decoded
into our document index and generates new text based on the user query.
The LLM uses both an encoder-decoder model to generate the output.
2.1. Creating a knowledge base
Providing Large Language Models (LLMs) with missing knowledge involves using a vector database to store private data. Unlike traditional databases, vector databases are specialized in managing and searching embedded vectors, which store numerical representations of documents. This breakdown into numerical embeddings enables AI systems to comprehend and process the data more effectively.

Source: https://www.pinecone.io/learn/retrieval-augmented-generation/
2.2. From text to Embeddings
Before storing data in the database, it must be converted into vector embeddings. This involves:
- Chunking: Segmenting text at the sentence or paragraph level. This approach derives meaning from surrounding words and can include additional context such as document titles or adjacent text.
- Embedding Models: Once text is chunked, embedding models are used to convert it into vectors that can be efficiently stored and processed.

2.3. Retrieval and Search
Several methods can be utilized for searching within the database:
- Keyword Search: For traditional text queries.
- Semantic Search: Based on the semantic meaning of words.
- Vector Search: Converts documents from text to vector representations using embedding models, retrieving documents whose vectors closely match the user query.
- Hybrid Search: Combines both keyword and vector search capabilities.
The retriever scans the knowledge database to find embeddings that are close together—essentially identifying texts that are similar in context or content.

It’s important to note that vectors can be created, ingested into the database, and the index updated in real-time, addressing the recency problem for LLMs in GenAI applications.
For instance, automated processes can be set up to generate vectors for new product launches, updating the index with each release. This mechanism allows a company's support chatbot to utilize RAG in accessing the latest product information and customer-specific data during interactions.
III. How it works - A RAG chatbot
1. Build a knowledge base
To create a knowledge base for chatbot use, documents must be stored as embeddings in a vector database. This requires an embedding model along with a vector database setup. For detailed instructions, Refer to articles on vector databases here.

The process involves two main steps:
- Chunking: Content should be segmented based on structure to achieve semantically coherent pieces. Strategies include Fixed-size chunking and "Content-aware" chunking.
- Embedding and Upsertion: Create vector embeddings for each segment and upsert each chunk as an individual record into a namespace.
Once documents are stored as embeddings, retrieval is conducted by querying for vector representations closest to the user question. Sorting may be necessary to ensure the results are prioritized by relevance.

2. Use Pinecone and LangChain for RAG
Integrating LLM capabilities allows for generating responses grounded in specific data using tools like Pinecone and LangChain. Here’s an example of how to set up the system:
class Pipeline:
def __init__(self):
# For find similar documents in vector database
self.score_threshold = 0.4
self.k = 5
# Initialize chat models and Pinecone index
self.index_name = "vector-db-name"
self.llm = ChatOpenAI(model_name="gpt-4o", temperature=0.9)
self.vector_store = CommunityPinecone.from_existing_index(self.index_name, OpenAIEmbeddings(model='text-embedding-3-small'))
# Initialize session state variables
self.initialize_session_state()
# Initialize conversation chain
self.initialize_conversation_chain()
...
# Function to retrieve similarity documents
def find_match(self, input_text):
matches = ""
result = self.vector_store.similarity_search_with_score(
input_text,
k=self.k)
matches = [doc.page_content for doc, score in result if score > self.score_threshold]
return "\n".join(matches)
...
def initialize_conversation_chain(self):
self.system_msg_template = SystemMessagePromptTemplate.from_template(
template="""
As an experienced filmmaker ...
"""
)
self.human_msg_template = HumanMessagePromptTemplate.from_template(template="{input}")
self.prompt_template = ChatPromptTemplate.from_messages([
self.system_msg_template,
self.human_msg_template
])
self.conversation = ConversationChain(
memory=st.session_state.buffer_memory,
prompt=self.prompt_template,
llm=self.llm,
verbose=True
)
pipeline = Pipeline()
...
context = pipeline.find_match(query)
response_text = pipeline.conversation.predict(input=f"Context:\n {context} \n\n Query:\n{query}")
For example, the pipeline retrieves movie-related data from a vector database and uses it to generate responses. As the image below, when asked to generate movie ideas, it references stored information from the vector database to generate the answer.

Reference
- Retrieval Augmented Generation (RAG)
- Generative AI for Beginners - Retrieval Augmented Generation (RAG) and Vector Databases
- Build a RAG chatbot
- Fixing Hallucination with Knowledge Bases
- Lecture 8 - Large Language Models, IN4080, Autumn 2024, University of Oslo
- Vector Database - Pinecone
- Pinecone GitHub: LangChain with Azure OpenAI - Retrieval Examples
- Pinecone GitHub: LangChain Multi-query