Baby RAG: A Gentle Introduction to Retrieval Augmented Generation

It's been a couple of years since humanity met LLMs, and today, even my mom talks to ChatGPT. Kids on TikTok ask their parents to ask "tititi" for answers.

And you may know that LLMs are great at generating answers, but not that great at making sure those answers are accurate. This happens for multiple reasons, but one is that they simply might not have information in their training data.

That's when our work as developers starts, giving LLMs the information they need to be able to better generate those answers. This process is commonly known as RAG (Retrieval Augmented Generation), and it is what I've seen 90% of companies trying to implement with varying levels of sophistication.

In this short article, I'll provide an introduction to the main concepts to help you understand how to make the most out of LLMs and your own data.

The Setting

I've been collecting my notes on my research about AI in this repository: AI-Study-Group. It contains books, videos, tutorials, papers, and more.

You could go to ChatGPT and ask him/her/it for recommendations. Some books might have been part of its training data, it might try to go to the internet and fetch some articles. However, it doesn't know my thoughts on them. So in this case, I want to be able to give an LLM the power to answer questions, based on my data: the books I've read, my notes, etc.

This is my AI able to respond to questions about materials to learn AI, based on my knowledge:

A RAG-powered assistant using my curated AI book collection

But for you, you could use the LLM to answer questions about your business:

What was our most sold item this season?
What products in our inventory are a good solution for this customer problem?
System XYZ is having an issue, the dashboard has a blinking light. What could be the solution?

From those examples, it is very likely that the public ChatGPT, Gemini, Claude, or the most powerful LLM in the world won't be able to answer because that data was not part of its training, and is not available on the internet. Even worse, the LLM will probably generate a wrong answer.

The Building Blocks

Ok, we understand the problem, now give me the answer!

And to be honest, the solution is pretty simple (to start, but difficult to get right). We just need a couple of things:

Our data source: database, documents, expert knowledge
A chunking strategy
An embedding model
A vector search engine
An LLM

I hope you're still with me, those items from above are fancy words for very straightforward concepts. We'll cover them as we progress.

Hands On

Note: There are so many ways of implementing the following. Use this as a guide, but not as a best practices guide.

First we start by sourcing our data. We know the LLM has a lot of knowledge and can potentially search the internet for more data. But we still need to provide our custom-specific data.

In our case, we want it to have our resources. This can come in multiple formats: PDFs, spreadsheets, images, audio, video, you name it.

We need to store them (in case you don't already have them in a database), this is usually called ingestion.

RAW data → Database

For this use case, I've found non-relational databases to be better. MongoDB's document model is flexible so it allows you to iterate quickly, to find the right structure for your data, before committing to a particular structure. (ps: I work at MongoDB)

I won't get into the details on how to ingest the data into MongoDB, but it is pretty easy.

Let's say my first item is a book, it has:

Title
Author
My notes on it

In MongoDB it would be just something like:

resources.insert(book)

Where resources is the collection, and book is a document.

There might be cases where I have an item that looks like:

Title
Author
My notes on it
Read date
Link to read

The structure for my book is different! And that's why I love MongoDB, because as I'm building I don't need to worry about the tables, and "schema". And it is important because there's no magic structure that works best for all use cases, you have to find it for your use case, and the ability to iterate is key.

That being said, the code to store the new book stays the same:

resources.insert(book)

Ok, now let's pretend I've added all the books in my repository, now my database would look something like this:

{
  "_id": "67f4a74759d7b45f2e180317",
  "title": "AI Engineering: Building Applications with Foundation Models",
  "author": "Chip Huyen",
  "review": "If you feel lost and don't know where to start, ...",
  "link": "https://www.oreilly.com/library/view/ai-engineering/9781098166298/"
}

Great, now we have our data in a database. And if your data was already in a database, then even better.

The next step is to allow the LLM to discover this new data.

If you have experience with databases, you know you can query your database to get results.

"Find the top 5 books that talk about language models".

We can perform a query to get books that contain "language models" in their description, and limit them to five results.

But what if the description I wrote doesn't include "language models"? The text search might not return any results, even if we have books that talk about related topics.

Embeddings & Vector Search

For this, we need semantic search. Another fancy term to say we want to search for "similar meaning".

A description like:

"This book covers in detail how models like GPT, and Sonnet work".

Is related to my query, even if we don't explicitly say that GPT and Sonnet are language models. But our "AI" is smart enough to understand that meaning.

But how?

Using an embedding model. Another fancy word to say: we take text, and convert it into a vector of numbers that we can then use to search for similarities. Popular options include OpenAI's text-embedding-3-small, VoyageAI, or open-source alternatives like sentence-transformers.

Text: "This book covers in detail how models like GPT, and Sonnet work"

embedded_text = embedding_function(text)

# Result: a vector of 1536 dimensions
embedded_text = [0.023, -0.041, 0.152, -0.089, 0.031, ...]

The embedded vector is not human readable, but we can use it to compute "distances". As we do with any other vector. Want to see what this looks like? Check this visualization of different embeddings.

That value can be stored next to our books. Since MongoDB is flexible, I can just append the value to our book:

# Generate the embedding using your model of choice
embedding = embedding_model.encode(book['review'])

# Store it alongside your document
resources.update_one(
    {'_id': book['_id']},
    {'$set': {'embedding': embedding}}
)

In MongoDB, embedding vectors are built into the database. If you are using a different database, check the support for storing vectors.

Chunking

The embedding models have a limit for the input they can take, and some don't recommend feeding huuuge strings because precision and accuracy may suffer. So a general idea is to "chunk" our strings.

And chunking goes beyond our example. If you have large documents you need to chunk it. Maybe per page? Per paragraph? Per chapter? It really depends on your use case.

But it is important because the size of your chunks, and the strategy you use to build them will impact the results.

Some examples:

If the chunks are too large, then the semantic meaning might be less accurate.
The context window will grow, making the LLM perform worse.

There are various techniques, and depending on how you approach this, your solution might work better or worse, so just try and find what works.

VoyageAI released a document-aware embedding model that brings in document-aware embedding. Read more about it here.

There's also a concept called chunk overlap, which allows the embedding model and the vector search to retrieve contiguous chunks. The overlap window of course will yield duplicated data, so be aware of this.

Performing the Search

Now, we need to perform that search. And we know our database has the vector with the "meaning" of the text. But what do we compare it against?

The user prompt!

user_question = "Recommend me a book that teaches the very basics of how ChatGPT or similar works"

# We generate an embedding on that string
query_embedding = embedding_function(user_question)

Now, you imagine, we will perform a distance computation between the vector we got from the question, with the ones stored in our database.

Luckily, we don't need to do it manually. MongoDB has built-in support for vector search. Again, your database might have a plugin to perform something similar.

In a nutshell, in MongoDB we ask to bring the top X results that are close to our query:

pipeline = [
    {
        "$vectorSearch": {
            "index": "vector_index",       # Name of your vector search index
            "path": "embedding",           # Field containing the embeddings
            "queryVector": query_embedding,
            "numCandidates": 100,          # Number of candidates to consider (higher = more accurate)
            "limit": top_k                 # Number of results to return
        }
    },
    {
        "$project": {
            "title": 1, "author": 1, "review": 1, "link": 1,
            "score": {"$meta": "vectorSearchScore"}
        }
    }
]

You need to define a vector search index first, more info here.

The query will return information of objects that match the criteria.

Now, we can use that information to feed the LLM with context to generate a response.

Prompt Engineering

If you have talked to ChatGPT, Gemini, Claude et al., you've been doing prompting. Giving instructions in natural language for the model to perform a task. But the LLM is likely receiving other instructions before and after yours.

For example: "What's the capital of Ukraine?"

[Prompt by provider]
[Your prompt, your question about Ukraine]
[Prompt by provider]

Those prompts at the beginning and at the end are invisible to you, and are likely warm up instructions and guardrails for the assistant to be more effective. There's more to it, but let's keep it simple.

There's a lot to talk about prompting: Effective context engineering for AI agents by Anthropic, Prompt Engineering for LLMs book. And it is important to find the right balance between the context you feed, the prompt, the model used, etc. But for the sake of simplicity, let's keep this simple.

To feed the context to an LLM (GPT in our example), we could do something like:

prompt = f"""You are an AI book recommendation assistant specializing in AI and machine learning books.

User Query: {query}

Based on the following relevant books from our database:

{context}

Please provide a helpful recommendation response that:
1. Addresses the user's specific query
2. Recommends the most suitable books from the list above
3. Explains why each book is relevant to their needs
4. Provides a brief summary of what they can expect from each recommendation
5. Suggests a reading order if applicable

Keep your response conversational and helpful."""

context has the information we got from the database, and it contains all the info we care about: title, author, notes, etc.

Finally, we simply call the LLM service:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful AI book recommendation assistant."},
        {"role": "user", "content": prompt}
    ],
    max_tokens=800,
    temperature=0.7
)

ai_response = response.choices[0].message.content

return ai_response

We are now ready to receive the response by the LLM. Usually, you call an API either in the cloud or local (using something like Ollama).

With the result, we can provide the response to the user. And continue the conversation.

Wrap Up

This is it, what most businesses are building on top of LLMs. Take into account that in order to make this more useful to you, your uses and business, there's a long way to go. The steps above won't really be useful in a production-ready system. There's more to know.

Advanced Concepts to Explore

Reranking: How to sort the results based on relevancy.
Hybrid search: Combining vector search with keyword search—semantic understanding plus exact matches for product codes or technical terms.
Evaluation: How can you make sure your responses are accurate and useful to your users.
Fine-tuning: Adapting a model to your domain by training on your data, baking knowledge into the model rather than retrieving it at query time.
Memory: How to allow the LLM to keep track of the conversation, or previous conversations.

Want to learn more about AI? Take a look at this repository.

Thanks for reading.