Reasoning Tokens: The Hidden Cost of o3 and o4-mini

OpenAI's reasoning models — o3 and o4-mini — produce a type of token you never see in the response but always pay for: reasoning tokens. These "thinking" tokens are the model's internal chain-of-thought, and they can multiply your actual cost by 5–20x compared to what you'd expect from the visible output alone.

What Are Reasoning Tokens?

When you send a request to o3 or o4-mini, the model doesn't jump straight to an answer. It first generates an internal reasoning chain — a step-by-step thought process that works through the problem. These intermediate tokens are called reasoning tokens (sometimes called "thinking tokens").

You don't see them in the response. The API returns only the final answer. But the reasoning tokens are generated sequentially just like output tokens, they consume compute, and they appear on your bill.

How Reasoning Tokens Are Billed

Reasoning tokens are billed at the output token rate, which is the expensive side of the pricing split:

o3: $10/M input, $40/M output (reasoning tokens billed at $40/M)
o4-mini: $1.10/M input, $4.40/M output (reasoning tokens billed at $4.40/M)

The API response includes a usage object that breaks this down:

{
  "usage": {
    "prompt_tokens": 150,
    "completion_tokens": 8500,
    "completion_tokens_details": {
      "reasoning_tokens": 8200,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  }
}

In this example, the visible response might be 300 tokens, but the model generated 8,200 reasoning tokens internally. Your bill reflects all 8,500 completion tokens at the output rate.

The Cost Surprise

Here's a real scenario. You ask o3 to solve a coding problem. The visible answer is 200 tokens. But the model used 5,000 reasoning tokens to work through the solution. Your actual cost:

Input: 150 tokens × $10/M = $0.0015
Output (visible): 200 tokens × $40/M = $0.008
Reasoning: 5,000 tokens × $40/M = $0.20
Total: $0.21 — and 95% of the cost is reasoning tokens you never see

Compare this to GPT-4o for the same task: 150 input + 200 output = $0.0024. The reasoning model costs 87x more for this request.

How to Control Reasoning Token Usage

1. Use reasoning_effort

OpenAI provides a reasoning_effort parameter that controls how much thinking the model does:

response = client.chat.completions.create(
    model="o3",
    reasoning_effort="low",  # low, medium, or high
    messages=[{"role": "user", "content": prompt}]
)

Setting reasoning_effort to "low" can reduce reasoning tokens by 50–80% for simpler tasks. Use "high" only for genuinely complex problems like math proofs or multi-step code generation.

2. Use max_completion_tokens

The max_completion_tokens parameter caps the total output including reasoning tokens. If you set it to 2,000, the model must fit both its thinking and its answer within that budget:

response = client.chat.completions.create(
    model="o3",
    max_completion_tokens=2000,
    messages=[{"role": "user", "content": prompt}]
)

Be careful — setting this too low means the model may not have enough tokens to reason properly and produce a good answer.

3. Route by Task Complexity

Don't use reasoning models for everything. Build a router that sends simple tasks to GPT-4o and only uses o3/o4-mini for tasks that genuinely benefit from extended reasoning:

GPT-4o: Classification, summarization, extraction, simple Q&A
o4-mini: Multi-step logic, code debugging, math problems
o3: Complex research, novel problem-solving, competition-level tasks

Monitoring Reasoning Token Spend

Always log the full usage object, including completion_tokens_details. Track the ratio of reasoning tokens to visible output tokens. If a task consistently uses 10,000+ reasoning tokens for a 100-token answer, it's a candidate for prompt optimization or model downgrade.

Reasoning tokens are the biggest hidden cost in modern LLM APIs. Monitor them, control them with reasoning_effort, and only use reasoning models when the task demands it.