GENERAL

What Is RAG (Retrieval-Augmented Generation)? A Practical Primer

RAG lets LLMs answer questions using your own data. Here is how it works under the hood and the architecture decisions that matter.

CCatalayer 2026-04-19 3 min read

The Problem RAG Solves

Large language models have two structural limitations:

  1. Knowledge cutoff: They only know what was in their training data.
  2. Hallucination: They confidently make up plausible-sounding facts when asked questions outside their knowledge.

Retrieval-Augmented Generation (RAG) addresses both by letting the model look things up before answering.

The Basic Architecture

1. Index your data

Split your documents into chunks (usually 200-1000 tokens). Embed each chunk into a vector using a sentence-transformer model. Store the vectors in a vector database.

2. Query

When a user asks a question:

  • Embed the question into a vector
  • Find the top-K most similar chunks in the vector database (cosine similarity, etc.)
  • Pass the retrieved chunks to the LLM along with the question
  • LLM generates an answer grounded in the retrieved context

3. Citations

Modern RAG systems also return citations — the source chunks used. This is how RAG combats hallucination: the model is asked to reference retrieved material.

  • Cheap: the LLM doesn't need to be retrained on your data
  • Fast to iterate: update the vector database, not the model weights
  • Controllable: you choose what goes into retrieval
  • Auditable: answers can cite sources

Chunking: the Under-Appreciated Decision

Your chunking strategy makes or breaks RAG quality:

Fixed-size chunks (simple, common)

Split every 500 tokens. Easy. Loses semantic boundaries.

Semantic chunks

Split at sentence or paragraph boundaries; keep chunks that discuss the same topic together.

Hierarchical chunks

Embed both paragraph-level and document-level; retrieve at the finer grain but surface context from the coarser one.

Chunk overlap

10-30 token overlap between chunks prevents query-relevant information from being split across boundaries.

Retrieval Strategies

Dense retrieval (semantic)

Vector similarity. Great for paraphrased queries.

Sparse retrieval (BM25, keyword)

Traditional text search. Great for exact-term queries.

Hybrid retrieval

Combine both. Typically best in practice.

Reranking

After initial retrieval, run a slower but more accurate cross-encoder to re-rank top candidates. Substantially improves quality at the cost of latency.

Vector Database Options

  • Pinecone — managed, fast, reliable, expensive at scale
  • Qdrant — open source, self-hostable, good performance
  • Chroma — dev-friendly, embeddable
  • Weaviate — broad feature set
  • pgvector — Postgres extension; simple if you already run Postgres
  • Elasticsearch / OpenSearch — supports both dense and sparse; classic choice

Common Pitfalls

  • Chunks too big: LLM overwhelmed, less-relevant content dilutes the answer
  • Chunks too small: lose context, answers feel fragmented
  • No reranking: top-K dense retrieval alone can miss by citing adjacent-but-wrong chunks
  • Poor embedding model: mismatched domain (e.g. using a general embedding model for legal or medical text)
  • Ignoring metadata: filtering by date / source / category before vector search dramatically improves quality

When to Use RAG vs Fine-Tuning vs Prompting

Prompting only

  • Small amount of context
  • General-purpose use
  • Fast iteration

RAG

  • Large corpus
  • Frequently updated data
  • Need citations
  • Cost-sensitive

Fine-tuning

  • Style/format adaptation (e.g. tone, structure)
  • Domain-specific reasoning
  • Very high volume (amortized cost pays off)

Many production systems combine RAG + light fine-tuning.

Evaluating RAG Quality

  • Retrieval recall: did the correct chunks make it to the model?
  • Answer faithfulness: did the answer stick to the retrieved context?
  • Answer completeness: did it address all parts of the question?

Use frameworks like RAGAS or build your own eval set. Don't ship without evals.

Key Takeaways

  • RAG = retrieve relevant chunks + generate answer grounded in them
  • Chunking strategy matters more than most teams realize
  • Hybrid retrieval + reranking beats dense-only
  • Build evals from day one
  • Use RAG when data is large / updated; fine-tune for style or domain reasoning

Explore [/glossary](/glossary) for related terms and [/topic/ai-infrastructure](/topic/ai-infrastructure) for live AI news.

Related Guides
Ready to explore Catalayer?
Explore the platform, or bring us your next product idea.
Explore ProductsStart Free Trial