The Problem RAG Solves
Large language models have two structural limitations:
- Knowledge cutoff: They only know what was in their training data.
- Hallucination: They confidently make up plausible-sounding facts when asked questions outside their knowledge.
Retrieval-Augmented Generation (RAG) addresses both by letting the model look things up before answering.
The Basic Architecture
1. Index your data
Split your documents into chunks (usually 200-1000 tokens). Embed each chunk into a vector using a sentence-transformer model. Store the vectors in a vector database.
2. Query
When a user asks a question:
- Embed the question into a vector
- Find the top-K most similar chunks in the vector database (cosine similarity, etc.)
- Pass the retrieved chunks to the LLM along with the question
- LLM generates an answer grounded in the retrieved context
3. Citations
Modern RAG systems also return citations — the source chunks used. This is how RAG combats hallucination: the model is asked to reference retrieved material.
Why RAG Is Popular
- Cheap: the LLM doesn't need to be retrained on your data
- Fast to iterate: update the vector database, not the model weights
- Controllable: you choose what goes into retrieval
- Auditable: answers can cite sources
Chunking: the Under-Appreciated Decision
Your chunking strategy makes or breaks RAG quality:
Fixed-size chunks (simple, common)
Split every 500 tokens. Easy. Loses semantic boundaries.
Semantic chunks
Split at sentence or paragraph boundaries; keep chunks that discuss the same topic together.
Hierarchical chunks
Embed both paragraph-level and document-level; retrieve at the finer grain but surface context from the coarser one.
Chunk overlap
10-30 token overlap between chunks prevents query-relevant information from being split across boundaries.
Retrieval Strategies
Dense retrieval (semantic)
Vector similarity. Great for paraphrased queries.
Sparse retrieval (BM25, keyword)
Traditional text search. Great for exact-term queries.
Hybrid retrieval
Combine both. Typically best in practice.
Reranking
After initial retrieval, run a slower but more accurate cross-encoder to re-rank top candidates. Substantially improves quality at the cost of latency.
Vector Database Options
- Pinecone — managed, fast, reliable, expensive at scale
- Qdrant — open source, self-hostable, good performance
- Chroma — dev-friendly, embeddable
- Weaviate — broad feature set
- pgvector — Postgres extension; simple if you already run Postgres
- Elasticsearch / OpenSearch — supports both dense and sparse; classic choice
Common Pitfalls
- Chunks too big: LLM overwhelmed, less-relevant content dilutes the answer
- Chunks too small: lose context, answers feel fragmented
- No reranking: top-K dense retrieval alone can miss by citing adjacent-but-wrong chunks
- Poor embedding model: mismatched domain (e.g. using a general embedding model for legal or medical text)
- Ignoring metadata: filtering by date / source / category before vector search dramatically improves quality
When to Use RAG vs Fine-Tuning vs Prompting
Prompting only
- Small amount of context
- General-purpose use
- Fast iteration
RAG
- Large corpus
- Frequently updated data
- Need citations
- Cost-sensitive
Fine-tuning
- Style/format adaptation (e.g. tone, structure)
- Domain-specific reasoning
- Very high volume (amortized cost pays off)
Many production systems combine RAG + light fine-tuning.
Evaluating RAG Quality
- Retrieval recall: did the correct chunks make it to the model?
- Answer faithfulness: did the answer stick to the retrieved context?
- Answer completeness: did it address all parts of the question?
Use frameworks like RAGAS or build your own eval set. Don't ship without evals.
Key Takeaways
- RAG = retrieve relevant chunks + generate answer grounded in them
- Chunking strategy matters more than most teams realize
- Hybrid retrieval + reranking beats dense-only
- Build evals from day one
- Use RAG when data is large / updated; fine-tune for style or domain reasoning
Explore [/glossary](/glossary) for related terms and [/topic/ai-infrastructure](/topic/ai-infrastructure) for live AI news.