Why Evals Are the Hardest Part
Most LLM projects die not because the model is bad, but because the team can't measure whether changes improve or regress quality. Without evals, every prompt tweak becomes a coin flip. Shipping any meaningful change becomes risky.
The Eval Ladder
Level 1: Manual spot checks (20 test cases)
You run 20 examples and eyeball outputs. Good for initial iteration. Breaks down past 50 cases — humans can't hold context across that many examples.
Level 2: Labeled eval set (100-500 test cases)
You build a small labeled dataset. Each example has a prompt and a correct answer (or a rubric). New models / prompts can be scored against the set.
Level 3: Online evaluation (production traffic)
You sample production inputs and evaluate outputs continuously. Detects drift, regressions, and real-world edge cases.
Level 4: Hierarchical + trustworthy evals (1,000+ cases)
Mix of golden examples, synthetic edge cases, and user-feedback-flagged problems. Stratified across important subcategories.
Most production systems plateau at Level 2 or 3. Level 4 is expensive but matters for regulated or high-stakes domains.
Building a Labeled Eval Set
Step 1: Collect
Sample real production prompts (redact sensitive info). Aim for diversity: common cases, edge cases, historical failures.
Step 2: Label
For each prompt, have a human write the ideal answer OR a rubric for scoring. Classification / extraction is easy (objective). Open-ended generation is hard (needs rubric).
Step 3: Automate scoring
For objective tasks: exact match, F1, BLEU, etc. For subjective tasks: LLM-as-judge with a carefully-designed judge prompt.
Step 4: Version and compare
Run new prompts / models against the eval set. Track scores over time. Make improvement measurable.
LLM-as-Judge
Use a more capable model to score a weaker model's outputs. Works well when:
- The judge is clearly better than the model being evaluated
- The rubric is well-specified
- You periodically audit judge agreement vs human labels
Common pitfalls:
- Judge bias (same-family bias, position bias)
- Rubric drift if you keep modifying it
- High cost if run at scale
Common Metrics
Classification / extraction
Accuracy, F1, precision/recall per class, confusion matrix
Generation
Factuality (can be auto-checked with retrieval-grounded systems) Faithfulness (sticks to context) Fluency (human or LLM-judge) Style adherence (brand voice, formatting) Length appropriateness
User-facing
Acceptance rate (user kept the suggestion?) Edit distance (how much did user modify?) Regenerations per session User feedback (thumbs-up/down)
Production Online Evaluation
Sample 1-5% of production traffic and run evaluations asynchronously:
- Log prompt + response
- Run LLM-as-judge or classifier evaluators
- Aggregate by endpoint / user segment / time window
- Alert on score regressions
Regression Detection
Before deploying a prompt / model change:
- Run against full eval set
- Confirm score doesn't drop on any major segment
- Deploy canary to 5% of traffic; monitor for 24-48h
- Full rollout only if online metrics stable
Tools and Frameworks
- RAGAS: open-source for RAG eval
- Braintrust: commercial eval tooling
- LangSmith / LangFuse: eval + observability
- OpenAI Evals: framework for building test harnesses
- PromptFoo: CLI eval tool
Organizational Pattern
- One person owns the eval set (not everyone adds ad-hoc)
- Every prompt / model change goes through eval gate
- Weekly review of score trends
- Monthly addition of new edge cases (from user feedback, production failures)
Key Takeaways
- Evals are the hardest part of production LLM systems
- Build a labeled eval set early; iterate on it
- Use LLM-as-judge cautiously for subjective tasks
- Sample production traffic for drift detection
- Make every prompt change score-gated
Track AI infrastructure news at [/topic/ai-infrastructure](/topic/ai-infrastructure).