If you've looked at LLM API pricing, you've noticed the asymmetry: output tokens cost significantly more than input tokens. With GPT-4o, input is $2.50 per million tokens while output is $10 — a 4x multiplier. Claude 3.5 Sonnet charges $3 input vs $15 output — a 5x gap. Understanding why this happens and how to control it is one of the fastest ways to cut your API bill.
Why Output Tokens Cost More
Input tokens are processed in parallel. The model reads your entire prompt at once using matrix operations that GPUs handle efficiently. Output tokens are generated one at a time, sequentially. Each new token requires a full forward pass through the model, and the model must attend to all previous tokens (input + already-generated output) to produce the next one.
This sequential generation is computationally expensive. It ties up GPU memory and compute for the entire duration of the response. That's why providers charge a premium for output — it costs them more to produce.
The Pricing Landscape
Here's the input-to-output cost ratio across major models:
- GPT-4o: $2.50 / $10.00 → 4x
- GPT-4o-mini: $0.15 / $0.60 → 4x
- Claude 3.5 Sonnet: $3.00 / $15.00 → 5x
- Claude 3.5 Haiku: $0.80 / $4.00 → 5x
- Gemini 1.5 Pro: $1.25 / $5.00 → 4x
- o3: $10.00 / $40.00 → 4x (plus reasoning tokens)
The ratio is consistently 3–5x across all providers. This means a request that generates 2,000 output tokens costs as much as reading 8,000–10,000 input tokens.
Techniques to Minimize Output Tokens
1. Set max_tokens Explicitly
Always set max_tokens (or max_completion_tokens) to the minimum needed for your use case. Without it, the model may generate far more text than necessary.
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=150 # Cap the response length
)
2. Ask for Structured Output
Instead of letting the model write prose, request JSON or a specific format. A classification task that returns {"label": "positive", "confidence": 0.92} uses ~15 tokens. The same answer as prose — "The sentiment of this text is positive with a confidence of approximately 92%" — uses ~20 tokens. The savings compound across thousands of requests.
3. Use "Answer Only" Instructions
Models tend to explain their reasoning unless told not to. Add explicit instructions:
# Instead of:
"What is the capital of France?"
# Use:
"What is the capital of France? Reply with only the city name."
The first prompt might generate 30+ tokens of explanation. The second generates 1–2 tokens.
4. Use Enums and Constrained Outputs
When the answer is one of a fixed set of values, use OpenAI's response_format with a JSON schema or Anthropic's tool use to constrain the output:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={
"type": "json_schema",
"json_schema": {
"name": "classification",
"schema": {
"type": "object",
"properties": {
"category": {
"type": "string",
"enum": ["bug", "feature", "question"]
}
},
"required": ["category"]
}
}
}
)
5. Batch Similar Requests
Instead of making 10 separate API calls that each produce output overhead (JSON wrappers, repeated phrasing), send 10 items in one request and get a single structured response. The per-item output overhead drops significantly.
The 80/20 Rule of Token Costs
In most applications, output tokens account for 60–80% of the total cost even though they're a smaller portion of the total token count. Optimizing output length is almost always higher-leverage than trimming input prompts.
Focus on output first. A 50% reduction in output tokens saves more money than a 50% reduction in input tokens at every major provider.