Production LLM costs can balloon 5-10x in weeks without discipline. Here are ten techniques that measurably cut spend without hurting quality.

Where LLM Costs Come From

Input tokens: every character you send in, including system prompts, few-shots, retrieved context
Output tokens: typically 3-5x more expensive than input tokens
Tool calls / function calls: counted as both input and output tokens
Cached context (when supported): 2-10x cheaper than fresh input

Ten Techniques

1. Right-size the model

The largest model isn't always needed. For classification and extraction, Claude Haiku 4.5 or GPT-5-mini can match bigger models at 10x lower cost.

Rule of thumb: start with the smallest model that passes your evals.

2. Use prompt caching

Claude, Gemini, and OpenAI all now support prompt caching. Cache expensive static prefixes (system prompts, few-shot blocks) and pay fraction on cached tokens.

Claude's cache can be 10% of input cost; huge savings on repetitive-system-prompt workloads.

3. Batch when possible

Most providers offer a batch API with 50% discount for asynchronous jobs. For non-real-time use cases (offline scoring, reports), always use batch.

4. Trim system prompts

5,000-token system prompts are rarely necessary. Audit what is actually used. A 1,500-token trim at 10 requests/second saves 54M tokens/day.

5. Cap output length

Set max_tokens to the smallest value that satisfies the use case. The model will stop earlier when it has a clear schema to fill.

6. Use structured outputs (function calling)

Function calling produces denser outputs — the model doesn't emit verbose commentary, just the structured arguments. Often 30-60% cheaper than freeform + parsing.

7. Retrieval > context stuffing

For Q&A on large corpora, retrieve 5-10 relevant chunks instead of passing the entire corpus. 20-100x token reduction.

8. Stream + early termination

Streaming allows you to terminate once you have the answer. For classification tasks answered in the first 10 tokens, don't pay for the next 100 the model would have generated.

9. Compress history in chat

For multi-turn conversations, summarize older turns into a compressed note instead of passing the full history. Keep the recent 3-5 turns verbatim and a summary of earlier ones.

10. Distill to smaller models

When a task is stable, use the big model to label training data for a smaller model. Deploy the smaller one. Training-free inference savings of 5-20x are common.

Monitoring

Always log:

Tokens per request (input + output)
Cost per request
Cost per user / per endpoint
P50 / P95 latency

Set alerts on sudden cost increases — a single bad deployment can 10x your daily spend overnight.

Common Cost Mistakes

Passing full conversation history forever

Every turn, cost grows linearly. Compress after 5-10 turns.

Using the biggest model for every task

Routing queries by complexity to different models can cut cost 50%+.

Not using caching

If your workload has any static prefix, caching is nearly free money.

Retrying without exponential backoff

Provider errors → spam retries → burn tokens and rate limits.

Sending retrieved context in both system AND user messages

Duplicate tokens for no benefit.

When Quality Matters More Than Cost

Sometimes cost optimization hurts quality:

Financial recommendations
Medical/legal content
High-stakes decisions

For these, validate quality on evals before switching to a cheaper model.

Key Takeaways

Right-size the model per task; don't default to the biggest
Prompt caching, batch APIs, and function calling are cheap wins
Retrieval beats context stuffing for large corpora
Monitor and alert on cost per request
Distill to smaller models for stable tasks

Track AI news at [/topic/ai-stocks](/topic/ai-stocks) and [/topic/ai-infrastructure](/topic/ai-infrastructure).

LLM Cost Optimization: 10 Techniques That Actually Reduce Production Spend