Where LLM Costs Come From
- Input tokens: every character you send in, including system prompts, few-shots, retrieved context
- Output tokens: typically 3-5x more expensive than input tokens
- Tool calls / function calls: counted as both input and output tokens
- Cached context (when supported): 2-10x cheaper than fresh input
Ten Techniques
1. Right-size the model
The largest model isn't always needed. For classification and extraction, Claude Haiku 4.5 or GPT-5-mini can match bigger models at 10x lower cost.
Rule of thumb: start with the smallest model that passes your evals.
2. Use prompt caching
Claude, Gemini, and OpenAI all now support prompt caching. Cache expensive static prefixes (system prompts, few-shot blocks) and pay fraction on cached tokens.
Claude's cache can be 10% of input cost; huge savings on repetitive-system-prompt workloads.
3. Batch when possible
Most providers offer a batch API with 50% discount for asynchronous jobs. For non-real-time use cases (offline scoring, reports), always use batch.
4. Trim system prompts
5,000-token system prompts are rarely necessary. Audit what is actually used. A 1,500-token trim at 10 requests/second saves 54M tokens/day.
5. Cap output length
Set max_tokens to the smallest value that satisfies the use case. The model will stop earlier when it has a clear schema to fill.
6. Use structured outputs (function calling)
Function calling produces denser outputs — the model doesn't emit verbose commentary, just the structured arguments. Often 30-60% cheaper than freeform + parsing.
7. Retrieval > context stuffing
For Q&A on large corpora, retrieve 5-10 relevant chunks instead of passing the entire corpus. 20-100x token reduction.
8. Stream + early termination
Streaming allows you to terminate once you have the answer. For classification tasks answered in the first 10 tokens, don't pay for the next 100 the model would have generated.
9. Compress history in chat
For multi-turn conversations, summarize older turns into a compressed note instead of passing the full history. Keep the recent 3-5 turns verbatim and a summary of earlier ones.
10. Distill to smaller models
When a task is stable, use the big model to label training data for a smaller model. Deploy the smaller one. Training-free inference savings of 5-20x are common.
Monitoring
Always log:
- Tokens per request (input + output)
- Cost per request
- Cost per user / per endpoint
- P50 / P95 latency
Set alerts on sudden cost increases — a single bad deployment can 10x your daily spend overnight.
Common Cost Mistakes
Passing full conversation history forever
Every turn, cost grows linearly. Compress after 5-10 turns.
Using the biggest model for every task
Routing queries by complexity to different models can cut cost 50%+.
Not using caching
If your workload has any static prefix, caching is nearly free money.
Retrying without exponential backoff
Provider errors → spam retries → burn tokens and rate limits.
Sending retrieved context in both system AND user messages
Duplicate tokens for no benefit.
When Quality Matters More Than Cost
Sometimes cost optimization hurts quality:
- Financial recommendations
- Medical/legal content
- High-stakes decisions
For these, validate quality on evals before switching to a cheaper model.
Key Takeaways
- Right-size the model per task; don't default to the biggest
- Prompt caching, batch APIs, and function calling are cheap wins
- Retrieval beats context stuffing for large corpora
- Monitor and alert on cost per request
- Distill to smaller models for stable tasks
Track AI news at [/topic/ai-stocks](/topic/ai-stocks) and [/topic/ai-infrastructure](/topic/ai-infrastructure).