In development, hitting a token limit means a failed test. In production, it means a broken user experience, a lost customer request, or silent data loss. Building robust token limit handling is essential for any application that relies on LLM APIs. Here are the patterns that work.
The Three Failure Modes
Token limits can bite you in three ways:
- Input too long: Your prompt exceeds the model's context window. The API returns a 400 error.
- Output truncated: The model's response hits
max_tokensand gets cut off mid-sentence. You get partial, unusable output. - Combined overflow: Input + output together exceed the context window. The model starts generating but runs out of space, producing a short or degraded response.
Pattern 1: Pre-flight Token Check
Count tokens before sending the request. If the input is too large, truncate or chunk it before the API call — not after.
import tiktoken
def safe_completion(messages, model="gpt-4o", max_output=4096):
enc = tiktoken.encoding_for_model(model)
context_limit = 128_000 # GPT-4o limit
# Count input tokens
input_tokens = sum(
len(enc.encode(m["content"])) + 4 # message overhead
for m in messages
) + 3 # reply priming
available = context_limit - input_tokens
if available < max_output:
if available < 100:
raise TokenBudgetExceeded(
f"Input uses {input_tokens} tokens, "
f"only {available} left for output"
)
max_output = available # Reduce output budget
return client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_output
)
Pattern 2: Smart Truncation
When input is too long, don't just chop off the end. Different content types need different truncation strategies:
def truncate_to_budget(text, max_tokens, model="gpt-4o",
strategy="tail"):
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
if len(tokens) <= max_tokens:
return text
if strategy == "tail":
# Keep the end (good for conversations)
tokens = tokens[-max_tokens:]
elif strategy == "head":
# Keep the beginning (good for documents)
tokens = tokens[:max_tokens]
elif strategy == "middle_out":
# Keep start and end, drop middle
half = max_tokens // 2
tokens = tokens[:half] + tokens[-half:]
return enc.decode(tokens)
For conversations, keep the most recent messages (tail). For documents, keep the beginning which usually has the most important context (head). For code, keep the function signature and the end where the logic concludes (middle_out).
Pattern 3: Cascading Model Fallback
When a request is too large for one model, fall back to a model with a larger context window:
MODEL_CASCADE = [
{"model": "gpt-4o-mini", "limit": 128_000, "cost": "low"},
{"model": "gpt-4o", "limit": 128_000, "cost": "medium"},
{"model": "gemini-1.5-pro", "limit": 2_000_000, "cost": "medium"},
]
async def completion_with_fallback(messages, input_tokens):
for config in MODEL_CASCADE:
if input_tokens < config["limit"] - 4096:
try:
return await call_model(
config["model"], messages
)
except TokenLimitError:
continue
raise AllModelsFailed("Input too large for all models")
Pattern 4: Detect Truncated Output
Always check the finish_reason in the API response. If it's "length" instead of "stop", the output was cut off:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=500
)
choice = response.choices[0]
if choice.finish_reason == "length":
# Output was truncated — handle it
logger.warning("Response truncated",
extra={"usage": response.usage})
# Option A: Retry with higher max_tokens
# Option B: Ask model to continue
# Option C: Return partial result with warning
Pattern 5: Token Budget Middleware
In production systems, wrap your LLM calls in middleware that enforces budgets and logs usage:
class TokenBudgetMiddleware:
def __init__(self, daily_limit=10_000_000):
self.daily_limit = daily_limit
self.used_today = 0
async def call(self, messages, **kwargs):
input_tokens = self.count_tokens(messages)
if self.used_today + input_tokens > self.daily_limit:
raise DailyBudgetExceeded(
f"Used {self.used_today:,} of "
f"{self.daily_limit:,} daily tokens"
)
response = await self.client.create(
messages=messages, **kwargs
)
total = response.usage.total_tokens
self.used_today += total
return response
Key Takeaways
- Always count tokens before the API call, not after
- Check
finish_reasonon every response to catch truncation - Use strategy-appropriate truncation, not blind character slicing
- Build model fallback chains for handling oversized inputs
- Log token usage per request for monitoring and alerting
The best production systems never let a token limit error reach the user. They anticipate, adapt, and degrade gracefully.