Once your AI application moves past prototyping, you need a system to track token usage, enforce spending limits, and route requests to the most cost-effective model. Without it, a single runaway feature or a spike in traffic can blow through your monthly budget in hours. Here's how to architect a token budget system from the ground up.
Core Architecture
A token budget system has four components:
- Usage Tracker: Records every token consumed, broken down by user, feature, and model
- Budget Enforcer: Checks limits before each request and rejects calls that would exceed the budget
- Model Router: Selects the cheapest model that can handle the request
- Dashboard: Provides visibility into spending patterns and alerts
1. Usage Tracker
Every LLM call should pass through a wrapper that logs token usage:
from dataclasses import dataclass
from datetime import datetime
@dataclass
class TokenUsageRecord:
timestamp: datetime
user_id: str
feature: str
model: str
input_tokens: int
output_tokens: int
cost_usd: float
class UsageTracker:
# Pricing per million tokens
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet":{"input": 3.00, "output": 15.00},
}
def __init__(self, storage):
self.storage = storage # DB, Redis, etc.
def record(self, user_id, feature, model,
input_tokens, output_tokens):
pricing = self.PRICING[model]
cost = (
input_tokens * pricing["input"] / 1_000_000 +
output_tokens * pricing["output"] / 1_000_000
)
record = TokenUsageRecord(
timestamp=datetime.utcnow(),
user_id=user_id,
feature=feature,
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost_usd=cost,
)
self.storage.insert(record)
return record
2. Budget Enforcer
Check budgets at multiple levels — per-user, per-feature, and global — before allowing a request:
class BudgetEnforcer:
def __init__(self, storage):
self.storage = storage
self.limits = {
"user_daily_usd": 5.00,
"feature_daily_usd": 100.00,
"global_daily_usd": 500.00,
}
def check(self, user_id, feature, estimated_cost):
"""Raise if any budget would be exceeded."""
# Per-user daily limit
user_spent = self.storage.get_daily_spend(
user_id=user_id
)
if user_spent + estimated_cost > self.limits["user_daily_usd"]:
raise BudgetExceeded(
f"User {user_id} daily limit reached: "
f"${user_spent:.2f} / "
f"${self.limits['user_daily_usd']:.2f}"
)
# Per-feature daily limit
feature_spent = self.storage.get_daily_spend(
feature=feature
)
if feature_spent + estimated_cost > self.limits["feature_daily_usd"]:
raise BudgetExceeded(
f"Feature '{feature}' daily limit reached"
)
# Global daily limit
global_spent = self.storage.get_daily_spend()
if global_spent + estimated_cost > self.limits["global_daily_usd"]:
raise BudgetExceeded("Global daily budget exceeded")
3. Model Router
Route each request to the cheapest model that meets the quality requirements:
class ModelRouter:
MODELS = [
{
"name": "gpt-4o-mini",
"cost_per_1k": 0.00015 + 0.0006,
"max_context": 128_000,
"capabilities": ["classification", "extraction",
"summarization", "simple_qa"],
},
{
"name": "gpt-4o",
"cost_per_1k": 0.0025 + 0.01,
"max_context": 128_000,
"capabilities": ["classification", "extraction",
"summarization", "simple_qa",
"complex_reasoning", "code_gen"],
},
{
"name": "claude-sonnet",
"cost_per_1k": 0.003 + 0.015,
"max_context": 200_000,
"capabilities": ["classification", "extraction",
"summarization", "simple_qa",
"complex_reasoning", "code_gen",
"long_context"],
},
]
def select(self, task_type, input_tokens):
"""Pick the cheapest model that can handle this task."""
candidates = [
m for m in self.MODELS
if task_type in m["capabilities"]
and input_tokens < m["max_context"] - 4096
]
if not candidates:
raise NoSuitableModel(
f"No model supports '{task_type}' "
f"with {input_tokens} tokens"
)
# Sort by cost, pick cheapest
return min(candidates, key=lambda m: m["cost_per_1k"])
4. Putting It Together
Wrap everything into a single gateway that your application calls instead of the raw API:
class AIGateway:
def __init__(self):
self.tracker = UsageTracker(storage)
self.enforcer = BudgetEnforcer(storage)
self.router = ModelRouter()
async def complete(self, user_id, feature, task_type,
messages, **kwargs):
# 1. Count input tokens
input_tokens = count_tokens(messages)
# 2. Route to cheapest suitable model
model_config = self.router.select(
task_type, input_tokens
)
# 3. Estimate cost and check budget
estimated_cost = estimate_cost(
model_config, input_tokens, max_output=4096
)
self.enforcer.check(user_id, feature, estimated_cost)
# 4. Make the API call
response = await call_llm(
model_config["name"], messages, **kwargs
)
# 5. Record actual usage
self.tracker.record(
user_id, feature, model_config["name"],
response.usage.prompt_tokens,
response.usage.completion_tokens,
)
return response
Alerting and Monitoring
Set up alerts at key thresholds to catch problems early:
- 50% of daily budget: Informational alert — check if usage is on track
- 80% of daily budget: Warning — investigate if this is expected
- Single request > $1: Immediate alert — likely a bug or abuse
- Per-user spike: Alert when a user's hourly usage exceeds 10x their average
Store usage data in a time-series database or append-only log. You'll want to query by user, feature, model, and time range for cost attribution and optimization.
Key Design Decisions
- Pre-check vs post-check: Always check budgets before the API call. Post-check only catches overages after you've already spent the money.
- Estimated vs actual cost: Use estimated cost for budget checks (fast), record actual cost from the API response (accurate).
- Graceful degradation: When a budget is exceeded, don't just error. Downgrade to a cheaper model, reduce max_tokens, or queue the request for later.
A token budget system isn't optional at scale — it's the difference between a predictable $500/month bill and a surprise $5,000 invoice.