Token Optimization for Developers: How to Cut Your LLM Costs Without Cutting Quality

Tokens are the unit of currency in LLM APIs. Every character you send, every line of context you include, every example in your few-shot prompt — it all goes on the bill. If you’re building anything beyond a hobby project, token efficiency stops being a nice-to-have and starts being architecture.

This post covers the techniques that actually move the needle, with links to official documentation and benchmarks.


1. Understand What You’re Actually Paying For

Before optimizing, instrument first.

Both Anthropic and OpenAI return token usage in every API response. Log it from day one:

python

response = anthropic.messages.create(...)

print(response.usage.input_tokens)   # what you sent
print(response.usage.output_tokens)  # what the model generated

OpenAI equivalent:

python

response = openai.chat.completions.create(...)

print(response.usage.prompt_tokens)
print(response.usage.completion_tokens)

References:

For Anthropic models, there’s no official client-side tokenizer, but Anthropic’s Token counter API lets you count tokens before committing to a call — useful for trimming context programmatically.


2. Prompt Engineering: Write Less, Mean More

The most direct lever. Verbose prompts are the #1 source of unnecessary input tokens in production systems.

Cut filler phrases

These add tokens and zero value:

VerboseLean
“Please carefully analyze the following code and provide…”“Analyze this code:”
“I would like you to help me write…”“Write:”
“Can you please make sure to include…”“Include:”

Use structured input formats

JSON and XML are tokenizer-friendly when they’re terse. Markdown tables for data, not prose.

# BAD — prose context (47 tokens)
The user's name is Jorge, they are located in Porto, Portugal,
and their subscription plan is the Pro tier which costs $20/month.

# GOOD — structured (19 tokens)
user: {name: Jorge, location: Porto, plan: Pro, price: $20/month}

Specify output length explicitly

Models tend to pad responses unless told otherwise. One line in your prompt saves dozens in the output:

Respond in under 3 sentences.
Return only the JSON object, no explanation.
One-word answer only.

References:


3. Context Window Management

The context window is the most expensive real estate in your architecture. Every token in it is paid on every call.

Trim conversation history aggressively

In multi-turn chat applications, most teams send the entire conversation history on every request. This compounds fast.

python

# Naive — sends everything
messages = conversation_history  # might be 40 turns

# Better — sliding window
MAX_HISTORY_TOKENS = 2000
messages = trim_to_token_budget(conversation_history, MAX_HISTORY_TOKENS)

Common strategies:

  • Sliding window — keep only the last N turns
  • Summarization — compress older turns into a summary block before appending new ones
  • Relevance filtering — only include turns that are semantically related to the current query (embed + cosine similarity)

Don’t send what the model doesn’t need

File contents, database schemas, API specs — only include what’s directly relevant to the task at hand. If you’re asking the model to fix a bug in auth.py, it doesn’t need your entire codebase.

Retrieval-Augmented Generation (RAG) is the production-grade answer here: retrieve only the relevant chunks, inject only those.

References:


4. Prompt Caching — the Biggest Bang Per Dollar

Anthropic’s prompt caching lets you cache a prefix of your prompt and pay a fraction of the price on subsequent calls. As of 2025, cached input tokens cost 90% less than uncached.

This is transformative for any pattern where you have a large, stable system prompt — a long system instruction, a big document the model needs to reference, a RAG chunk that applies to many queries.

python

response = anthropic.messages.create(
    model="claude-sonnet-4-20250514",
    system=[
        {
            "type": "text",
            "text": your_large_stable_context,
            "cache_control": {"type": "ephemeral"}  # mark this prefix for caching
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

How it works: the first call writes to cache (you pay full price). Subsequent calls that share the exact same prefix up to the cache breakpoint pay only 10% of normal input token cost. The cache lives for 5 minutes by default, resetting on each cache hit.

Where it compounds fast:

  • System prompts with extensive instructions
  • Documents or codebases in context
  • Tool/function definitions (they can be large)
  • Multi-turn agents with a stable base context

References:

OpenAI also supports prompt caching (called “Cached inputs”) for GPT-4o and o-series models, at 50% discount on cached tokens:


5. Model Routing: Don’t Use a Sledgehammer for Every Nail

The single highest-leverage architectural decision: route prompts to the cheapest model that can handle the task.

Not every request needs Claude Sonnet or GPT-4o. Most of your production traffic is probably:

  • Classifying intent → Haiku or GPT-4o mini
  • Extracting structured data from clean input → Haiku
  • Simple Q&A with narrow scope → Haiku
  • Code generation with reasoning → Sonnet
  • Complex multi-step planning → Opus or o3

Rough pricing comparison (as of mid-2025):

ModelInput (per 1M tokens)Output (per 1M tokens)
Claude Haiku 3.5$0.80$4.00
Claude Sonnet 4$3.00$15.00
Claude Opus 4$15.00$75.00
GPT-4o mini$0.15$0.60
GPT-4o$2.50$10.00

Routing 60% of your traffic to Haiku while keeping complex tasks on Sonnet easily cuts your bill by 40-50% without degrading user-facing quality on the simple calls.

A basic router in Python:

python

import re

def route_prompt(prompt: str) -> str:
    if re.search(r'\b(classify|is this|yes or no|extract)\b', prompt, re.I):
        return "claude-haiku-3-5-20241022"
    if re.search(r'\b(plan|design|architecture|strategy)\b', prompt, re.I):
        return "claude-opus-4-20250514"
    return "claude-sonnet-4-20250514"  # default

For production, consider an embedding-based classifier or a cheap LLM call to classify intent before routing to the target model.

References:


6. Structured Outputs Reduce Output Tokens

When you need structured data back, tell the model to return only the structure. No preamble, no explanation, no “Here’s the JSON you requested:”.

Anthropic — enforce JSON via prompt:

python

system = "You are a data extractor. Respond ONLY with valid JSON. No explanation."

OpenAI — use the structured output feature (enforces schema at the API level):

python

response = openai.chat.completions.create(
    model="gpt-4o-mini",
    response_format={"type": "json_object"},
    messages=[...]
)

For Anthropic, prefilling the assistant turn is a reliable trick to cut output tokens:

python

messages = [
    {"role": "user", "content": "Extract: name, email, plan from this text: ..."},
    {"role": "assistant", "content": "{"}  # model continues from here
]

The model skips any preamble and starts directly inside the JSON object.

References:


7. Batch API for Non-Realtime Workloads

If you have workloads that don’t need a synchronous response — evals, bulk data processing, nightly report generation — the Batch API is a straightforward 50% discount.

Anthropic Message Batches:

python

batch = anthropic.messages.batches.create(
    requests=[
        {"custom_id": "req-1", "params": {"model": "...", "messages": [...]}},
        {"custom_id": "req-2", "params": {"model": "...", "messages": [...]}},
    ]
)
# Poll for results later

Results are available within 24 hours. Same models, same quality, half the price.

References:


8. Few-Shot Examples: Quality vs. Cost Tradeoff

Few-shot examples dramatically improve output quality but are expensive — each example consumes input tokens on every call.

Strategies to get the benefit without the full cost:

  1. Cache the examples — if your few-shot block is stable, it’s a perfect candidate for prompt caching (see section 4).
  2. Use one example instead of five — diminishing returns kick in fast. Test whether 1 example gets you 80% of the quality improvement of 5.
  3. Reference instead of repeat — for long or complex examples, store them in a vector DB and retrieve only the most relevant one based on the current input.
  4. Zero-shot with better instructions — sometimes a clearer instruction eliminates the need for examples entirely. Test zero-shot before adding shots.

References:


The Checklist

Before going to production, run through this:

  •  Are you logging input_tokens and output_tokens on every call?
  •  Do you have a per-call cost estimate in your observability stack?
  •  Is your system prompt cached if it’s over ~1,000 tokens?
  •  Is your conversation history trimmed to a token budget?
  •  Are trivial/classification tasks routed to a cheaper model?
  •  Are your output formats constrained (JSON only, max length)?
  •  Could any batch of calls move to the Batch API?
  •  Have you tested with fewer few-shot examples?

Each “no” is a line item on your bill.


Tools Worth Knowing

  • LangSmith — observability for LLM calls, including token usage per trace: smith.langchain.com
  • Helicone — LLM proxy with cost tracking, caching, and rate limiting: helicone.ai
  • Portkey — routing, fallbacks, and token analytics across providers: portkey.ai
  • Braintrust — evals + cost tracking, good for measuring quality/cost tradeoffs: braintrust.dev
  • Tiktoken — count OpenAI tokens before sending: github.com/openai/tiktoken

Token efficiency isn’t premature optimization — it’s the kind of engineering discipline that separates a prototype that works from a product that scales. The good news: most of the gains come from a handful of decisions made early, not from micro-optimizing every prompt.

Start with instrumentation. Everything else follows from seeing the data.


What’s the technique that moved the needle most for you? Leave a comment — I read everything.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *