Token Optimization for Developers: How to Cut Your LLM Costs Without Cutting Quality
Tokens are the unit of currency in LLM APIs. Every character you send, every line of context you include, every example in your few-shot prompt — it all goes on the bill. If you’re building anything beyond a hobby project, token efficiency stops being a nice-to-have and starts being architecture.
This post covers the techniques that actually move the needle, with links to official documentation and benchmarks.
1. Understand What You’re Actually Paying For
Before optimizing, instrument first.
Both Anthropic and OpenAI return token usage in every API response. Log it from day one:
python
response = anthropic.messages.create(...)
print(response.usage.input_tokens) # what you sent
print(response.usage.output_tokens) # what the model generated
OpenAI equivalent:
python
response = openai.chat.completions.create(...)
print(response.usage.prompt_tokens)
print(response.usage.completion_tokens)
References:
- Anthropic usage object: docs.anthropic.com/en/api/messages
- OpenAI usage tracking: platform.openai.com/docs/api-reference/chat/object
- Tiktoken (count tokens before sending, for OpenAI models): github.com/openai/tiktoken
For Anthropic models, there’s no official client-side tokenizer, but Anthropic’s Token counter API lets you count tokens before committing to a call — useful for trimming context programmatically.
2. Prompt Engineering: Write Less, Mean More
The most direct lever. Verbose prompts are the #1 source of unnecessary input tokens in production systems.
Cut filler phrases
These add tokens and zero value:
| Verbose | Lean |
|---|---|
| “Please carefully analyze the following code and provide…” | “Analyze this code:” |
| “I would like you to help me write…” | “Write:” |
| “Can you please make sure to include…” | “Include:” |
Use structured input formats
JSON and XML are tokenizer-friendly when they’re terse. Markdown tables for data, not prose.
# BAD — prose context (47 tokens)
The user's name is Jorge, they are located in Porto, Portugal,
and their subscription plan is the Pro tier which costs $20/month.
# GOOD — structured (19 tokens)
user: {name: Jorge, location: Porto, plan: Pro, price: $20/month}
Specify output length explicitly
Models tend to pad responses unless told otherwise. One line in your prompt saves dozens in the output:
Respond in under 3 sentences.
Return only the JSON object, no explanation.
One-word answer only.
References:
- Anthropic prompt engineering guide: docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview
- OpenAI prompt engineering guide: platform.openai.com/docs/guides/prompt-engineering
3. Context Window Management
The context window is the most expensive real estate in your architecture. Every token in it is paid on every call.
Trim conversation history aggressively
In multi-turn chat applications, most teams send the entire conversation history on every request. This compounds fast.
python
# Naive — sends everything
messages = conversation_history # might be 40 turns
# Better — sliding window
MAX_HISTORY_TOKENS = 2000
messages = trim_to_token_budget(conversation_history, MAX_HISTORY_TOKENS)
Common strategies:
- Sliding window — keep only the last N turns
- Summarization — compress older turns into a summary block before appending new ones
- Relevance filtering — only include turns that are semantically related to the current query (embed + cosine similarity)
Don’t send what the model doesn’t need
File contents, database schemas, API specs — only include what’s directly relevant to the task at hand. If you’re asking the model to fix a bug in auth.py, it doesn’t need your entire codebase.
Retrieval-Augmented Generation (RAG) is the production-grade answer here: retrieve only the relevant chunks, inject only those.
References:
- Context window limits by model: docs.anthropic.com/en/docs/about-claude/models/overview
- LangChain conversation buffer with token limit: python.langchain.com/docs/modules/memory/types/token_buffer
4. Prompt Caching — the Biggest Bang Per Dollar
Anthropic’s prompt caching lets you cache a prefix of your prompt and pay a fraction of the price on subsequent calls. As of 2025, cached input tokens cost 90% less than uncached.
This is transformative for any pattern where you have a large, stable system prompt — a long system instruction, a big document the model needs to reference, a RAG chunk that applies to many queries.
python
response = anthropic.messages.create(
model="claude-sonnet-4-20250514",
system=[
{
"type": "text",
"text": your_large_stable_context,
"cache_control": {"type": "ephemeral"} # mark this prefix for caching
}
],
messages=[{"role": "user", "content": user_query}]
)
How it works: the first call writes to cache (you pay full price). Subsequent calls that share the exact same prefix up to the cache breakpoint pay only 10% of normal input token cost. The cache lives for 5 minutes by default, resetting on each cache hit.
Where it compounds fast:
- System prompts with extensive instructions
- Documents or codebases in context
- Tool/function definitions (they can be large)
- Multi-turn agents with a stable base context
References:
- Anthropic prompt caching docs: docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- Caching pricing breakdown: anthropic.com/pricing
OpenAI also supports prompt caching (called “Cached inputs”) for GPT-4o and o-series models, at 50% discount on cached tokens:
- OpenAI prompt caching: platform.openai.com/docs/guides/prompt-caching
5. Model Routing: Don’t Use a Sledgehammer for Every Nail
The single highest-leverage architectural decision: route prompts to the cheapest model that can handle the task.
Not every request needs Claude Sonnet or GPT-4o. Most of your production traffic is probably:
- Classifying intent → Haiku or GPT-4o mini
- Extracting structured data from clean input → Haiku
- Simple Q&A with narrow scope → Haiku
- Code generation with reasoning → Sonnet
- Complex multi-step planning → Opus or o3
Rough pricing comparison (as of mid-2025):
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude Haiku 3.5 | $0.80 | $4.00 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| Claude Opus 4 | $15.00 | $75.00 |
| GPT-4o mini | $0.15 | $0.60 |
| GPT-4o | $2.50 | $10.00 |
Routing 60% of your traffic to Haiku while keeping complex tasks on Sonnet easily cuts your bill by 40-50% without degrading user-facing quality on the simple calls.
A basic router in Python:
python
import re
def route_prompt(prompt: str) -> str:
if re.search(r'\b(classify|is this|yes or no|extract)\b', prompt, re.I):
return "claude-haiku-3-5-20241022"
if re.search(r'\b(plan|design|architecture|strategy)\b', prompt, re.I):
return "claude-opus-4-20250514"
return "claude-sonnet-4-20250514" # default
For production, consider an embedding-based classifier or a cheap LLM call to classify intent before routing to the target model.
References:
- Anthropic model overview with context windows and pricing: docs.anthropic.com/en/docs/about-claude/models/overview
- OpenAI model pricing: openai.com/api/pricing
6. Structured Outputs Reduce Output Tokens
When you need structured data back, tell the model to return only the structure. No preamble, no explanation, no “Here’s the JSON you requested:”.
Anthropic — enforce JSON via prompt:
python
system = "You are a data extractor. Respond ONLY with valid JSON. No explanation."
OpenAI — use the structured output feature (enforces schema at the API level):
python
response = openai.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_object"},
messages=[...]
)
For Anthropic, prefilling the assistant turn is a reliable trick to cut output tokens:
python
messages = [
{"role": "user", "content": "Extract: name, email, plan from this text: ..."},
{"role": "assistant", "content": "{"} # model continues from here
]
The model skips any preamble and starts directly inside the JSON object.
References:
- Anthropic prefilling: docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prefill-claudes-response
- OpenAI structured outputs: platform.openai.com/docs/guides/structured-outputs
7. Batch API for Non-Realtime Workloads
If you have workloads that don’t need a synchronous response — evals, bulk data processing, nightly report generation — the Batch API is a straightforward 50% discount.
Anthropic Message Batches:
python
batch = anthropic.messages.batches.create(
requests=[
{"custom_id": "req-1", "params": {"model": "...", "messages": [...]}},
{"custom_id": "req-2", "params": {"model": "...", "messages": [...]}},
]
)
# Poll for results later
Results are available within 24 hours. Same models, same quality, half the price.
References:
- Anthropic Message Batches API: docs.anthropic.com/en/docs/build-with-claude/message-batches
- OpenAI Batch API: platform.openai.com/docs/guides/batch
8. Few-Shot Examples: Quality vs. Cost Tradeoff
Few-shot examples dramatically improve output quality but are expensive — each example consumes input tokens on every call.
Strategies to get the benefit without the full cost:
- Cache the examples — if your few-shot block is stable, it’s a perfect candidate for prompt caching (see section 4).
- Use one example instead of five — diminishing returns kick in fast. Test whether 1 example gets you 80% of the quality improvement of 5.
- Reference instead of repeat — for long or complex examples, store them in a vector DB and retrieve only the most relevant one based on the current input.
- Zero-shot with better instructions — sometimes a clearer instruction eliminates the need for examples entirely. Test zero-shot before adding shots.
References:
- Few-shot prompting research (Brown et al., 2020 — the original GPT-3 paper): arxiv.org/abs/2005.14165
- Anthropic’s guidance on example usage: docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/use-examples
The Checklist
Before going to production, run through this:
- Are you logging
input_tokensandoutput_tokenson every call? - Do you have a per-call cost estimate in your observability stack?
- Is your system prompt cached if it’s over ~1,000 tokens?
- Is your conversation history trimmed to a token budget?
- Are trivial/classification tasks routed to a cheaper model?
- Are your output formats constrained (JSON only, max length)?
- Could any batch of calls move to the Batch API?
- Have you tested with fewer few-shot examples?
Each “no” is a line item on your bill.
Tools Worth Knowing
- LangSmith — observability for LLM calls, including token usage per trace: smith.langchain.com
- Helicone — LLM proxy with cost tracking, caching, and rate limiting: helicone.ai
- Portkey — routing, fallbacks, and token analytics across providers: portkey.ai
- Braintrust — evals + cost tracking, good for measuring quality/cost tradeoffs: braintrust.dev
- Tiktoken — count OpenAI tokens before sending: github.com/openai/tiktoken
Token efficiency isn’t premature optimization — it’s the kind of engineering discipline that separates a prototype that works from a product that scales. The good news: most of the gains come from a handful of decisions made early, not from micro-optimizing every prompt.
Start with instrumentation. Everything else follows from seeing the data.
What’s the technique that moved the needle most for you? Leave a comment — I read everything.