A model's context window is the total number of tokens it can process in a single request — your input and its output combined. It's the working memory of the model. Everything the model knows about your conversation, your documents, and your instructions must fit inside this window.
What the Numbers Actually Mean
When a model advertises "128K context," that means 128,000 tokens total. For English text, that's roughly 96,000 words or about 300 pages. But remember: this budget is shared between your prompt and the model's response.
If you send 120K tokens of input to a 128K model and set max_tokens to 16K for the response, the API will reject the request because 120K + 16K exceeds 128K. You need to plan your token budget carefully.
Context Window Comparison
Here's how current models compare:
- GPT-3.5 Turbo: 16K tokens (~12K words)
- GPT-4: 8K or 32K tokens
- GPT-4o: 128K tokens (~96K words)
- GPT-4.1: 1M tokens (~750K words)
- Claude 3.5 Sonnet: 200K tokens (~150K words)
- Claude 3 Opus: 200K tokens (~150K words)
- Gemini 1.5 Pro: 2M tokens (~1.5M words)
- Gemini 2.5 Pro: 1M tokens (~750K words)
- LLaMA 3.1 405B: 128K tokens (~96K words)
- Mistral Large: 128K tokens (~96K words)
Bigger Isn't Always Better
A larger context window doesn't automatically mean better performance. Research has shown that models can struggle with information in the middle of very long contexts — a phenomenon called "lost in the middle." The model tends to pay more attention to the beginning and end of the input.
Performance Degradation
Several practical issues arise with very long contexts:
- Latency increases: Processing 1M tokens takes significantly longer than 10K tokens. Time-to-first-token can jump from milliseconds to seconds.
- Cost scales linearly: 10x more tokens means 10x higher cost per request.
- Accuracy can drop: When asked to find a specific fact buried in 500K tokens of context, models may miss it or hallucinate an answer.
Practical Context Window Strategies
For Short Conversations (Under 4K Tokens)
Any model works fine. Choose based on quality and cost, not context size. Most single-turn Q&A fits comfortably in 4K tokens.
For Document Analysis (4K–100K Tokens)
This is the sweet spot for most business applications. A 50-page report fits in about 15K–20K tokens. Use GPT-4o, Claude 3.5, or similar models. At this range, accuracy is high and latency is reasonable.
For Large Document Sets (100K+ Tokens)
When you need to process entire codebases, legal document sets, or book-length content, you have two approaches:
- Use a large-context model: Gemini 1.5 Pro (2M) or GPT-4.1 (1M) can ingest the entire document. Simple but expensive.
- Use RAG (Retrieval-Augmented Generation): Chunk your documents, embed them in a vector database, and retrieve only the relevant chunks for each query. More complex but much cheaper at scale.
Calculating Your Context Budget
A practical formula for planning your context usage:
Available for user content = Context Window
- System prompt tokens
- Conversation history tokens
- Reserved output tokens (max_tokens)
- Safety margin (5-10%)
For example, with GPT-4o (128K context):
128,000 total
- 1,000 system prompt
- 4,000 conversation history
- 4,096 reserved output
- 6,400 safety margin (5%)
= 112,504 tokens available for documents
Always count your actual token usage rather than estimating. The difference between estimated and actual counts can be 20% or more, especially with code or non-English text.