RAG lets LLMs answer questions using your own data. Here is how it works under the hood and the architecture decisions that matter.

The Problem RAG Solves

Large language models have two structural limitations:

Knowledge cutoff: They only know what was in their training data.
Hallucination: They confidently make up plausible-sounding facts when asked questions outside their knowledge.

Retrieval-Augmented Generation (RAG) addresses both by letting the model look things up before answering.

The Basic Architecture

1. Index your data

Split your documents into chunks (usually 200-1000 tokens). Embed each chunk into a vector using a sentence-transformer model. Store the vectors in a vector database.

2. Query

When a user asks a question:

Embed the question into a vector
Find the top-K most similar chunks in the vector database (cosine similarity, etc.)
Pass the retrieved chunks to the LLM along with the question
LLM generates an answer grounded in the retrieved context

3. Citations

Modern RAG systems also return citations — the source chunks used. This is how RAG combats hallucination: the model is asked to reference retrieved material.

Why RAG Is Popular

Cheap: the LLM doesn't need to be retrained on your data
Fast to iterate: update the vector database, not the model weights
Controllable: you choose what goes into retrieval
Auditable: answers can cite sources

Chunking: the Under-Appreciated Decision

Your chunking strategy makes or breaks RAG quality:

Fixed-size chunks (simple, common)

Split every 500 tokens. Easy. Loses semantic boundaries.

Semantic chunks

Split at sentence or paragraph boundaries; keep chunks that discuss the same topic together.

Hierarchical chunks

Embed both paragraph-level and document-level; retrieve at the finer grain but surface context from the coarser one.

Chunk overlap

10-30 token overlap between chunks prevents query-relevant information from being split across boundaries.

Retrieval Strategies

Dense retrieval (semantic)

Vector similarity. Great for paraphrased queries.

Sparse retrieval (BM25, keyword)

Traditional text search. Great for exact-term queries.

Hybrid retrieval

Combine both. Typically best in practice.

Reranking

After initial retrieval, run a slower but more accurate cross-encoder to re-rank top candidates. Substantially improves quality at the cost of latency.

Vector Database Options

Pinecone — managed, fast, reliable, expensive at scale
Qdrant — open source, self-hostable, good performance
Chroma — dev-friendly, embeddable
Weaviate — broad feature set
pgvector — Postgres extension; simple if you already run Postgres
Elasticsearch / OpenSearch — supports both dense and sparse; classic choice

Common Pitfalls

Chunks too big: LLM overwhelmed, less-relevant content dilutes the answer
Chunks too small: lose context, answers feel fragmented
No reranking: top-K dense retrieval alone can miss by citing adjacent-but-wrong chunks
Poor embedding model: mismatched domain (e.g. using a general embedding model for legal or medical text)
Ignoring metadata: filtering by date / source / category before vector search dramatically improves quality

When to Use RAG vs Fine-Tuning vs Prompting

Prompting only

Small amount of context
General-purpose use
Fast iteration

RAG

Large corpus
Frequently updated data
Need citations
Cost-sensitive

Fine-tuning

Style/format adaptation (e.g. tone, structure)
Domain-specific reasoning
Very high volume (amortized cost pays off)

Many production systems combine RAG + light fine-tuning.

Evaluating RAG Quality

Retrieval recall: did the correct chunks make it to the model?
Answer faithfulness: did the answer stick to the retrieved context?
Answer completeness: did it address all parts of the question?

Use frameworks like RAGAS or build your own eval set. Don't ship without evals.

Key Takeaways

RAG = retrieve relevant chunks + generate answer grounded in them
Chunking strategy matters more than most teams realize
Hybrid retrieval + reranking beats dense-only
Build evals from day one
Use RAG when data is large / updated; fine-tune for style or domain reasoning

Explore [/glossary](/glossary) for related terms and [/topic/ai-infrastructure](/topic/ai-infrastructure) for live AI news.

What Is RAG (Retrieval-Augmented Generation)? A Practical Primer