GENERAL

LLM Evals in Production: How to Build Quality Measurement That Actually Scales

Eval loops separate ship-and-hope LLM projects from production-grade AI. Here is how to build eval infrastructure that scales beyond the first 100 test cas

CCatalayer 2026-04-19 3 min read

Why Evals Are the Hardest Part

Most LLM projects die not because the model is bad, but because the team can't measure whether changes improve or regress quality. Without evals, every prompt tweak becomes a coin flip. Shipping any meaningful change becomes risky.

The Eval Ladder

Level 1: Manual spot checks (20 test cases)

You run 20 examples and eyeball outputs. Good for initial iteration. Breaks down past 50 cases — humans can't hold context across that many examples.

Level 2: Labeled eval set (100-500 test cases)

You build a small labeled dataset. Each example has a prompt and a correct answer (or a rubric). New models / prompts can be scored against the set.

Level 3: Online evaluation (production traffic)

You sample production inputs and evaluate outputs continuously. Detects drift, regressions, and real-world edge cases.

Level 4: Hierarchical + trustworthy evals (1,000+ cases)

Mix of golden examples, synthetic edge cases, and user-feedback-flagged problems. Stratified across important subcategories.

Most production systems plateau at Level 2 or 3. Level 4 is expensive but matters for regulated or high-stakes domains.

Building a Labeled Eval Set

Step 1: Collect

Sample real production prompts (redact sensitive info). Aim for diversity: common cases, edge cases, historical failures.

Step 2: Label

For each prompt, have a human write the ideal answer OR a rubric for scoring. Classification / extraction is easy (objective). Open-ended generation is hard (needs rubric).

Step 3: Automate scoring

For objective tasks: exact match, F1, BLEU, etc. For subjective tasks: LLM-as-judge with a carefully-designed judge prompt.

Step 4: Version and compare

Run new prompts / models against the eval set. Track scores over time. Make improvement measurable.

LLM-as-Judge

Use a more capable model to score a weaker model's outputs. Works well when:

  • The judge is clearly better than the model being evaluated
  • The rubric is well-specified
  • You periodically audit judge agreement vs human labels

Common pitfalls:

  • Judge bias (same-family bias, position bias)
  • Rubric drift if you keep modifying it
  • High cost if run at scale

Common Metrics

Classification / extraction

Accuracy, F1, precision/recall per class, confusion matrix

Generation

Factuality (can be auto-checked with retrieval-grounded systems) Faithfulness (sticks to context) Fluency (human or LLM-judge) Style adherence (brand voice, formatting) Length appropriateness

User-facing

Acceptance rate (user kept the suggestion?) Edit distance (how much did user modify?) Regenerations per session User feedback (thumbs-up/down)

Production Online Evaluation

Sample 1-5% of production traffic and run evaluations asynchronously:

  • Log prompt + response
  • Run LLM-as-judge or classifier evaluators
  • Aggregate by endpoint / user segment / time window
  • Alert on score regressions

Regression Detection

Before deploying a prompt / model change:

  • Run against full eval set
  • Confirm score doesn't drop on any major segment
  • Deploy canary to 5% of traffic; monitor for 24-48h
  • Full rollout only if online metrics stable

Tools and Frameworks

  • RAGAS: open-source for RAG eval
  • Braintrust: commercial eval tooling
  • LangSmith / LangFuse: eval + observability
  • OpenAI Evals: framework for building test harnesses
  • PromptFoo: CLI eval tool

Organizational Pattern

  • One person owns the eval set (not everyone adds ad-hoc)
  • Every prompt / model change goes through eval gate
  • Weekly review of score trends
  • Monthly addition of new edge cases (from user feedback, production failures)

Key Takeaways

  • Evals are the hardest part of production LLM systems
  • Build a labeled eval set early; iterate on it
  • Use LLM-as-judge cautiously for subjective tasks
  • Sample production traffic for drift detection
  • Make every prompt change score-gated

Track AI infrastructure news at [/topic/ai-infrastructure](/topic/ai-infrastructure).

Related Guides
Ready to explore Catalayer?
Explore the platform, or bring us your next product idea.
Explore ProductsStart Free Trial