How to Count Tokens Before Making an API Call

Counting tokens before you send a request to an LLM API is one of the most practical things you can do. It prevents context window errors, lets you estimate costs upfront, and helps you decide when to truncate or chunk your input. Here's how to do it in Python, JavaScript, and Go.

Python: Using tiktoken (Exact Count)

OpenAI's tiktoken library is the gold standard for counting tokens for GPT models. It uses the exact same tokenizer the API uses, so the count is precise.

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given model."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Usage
prompt = "Explain quantum computing in simple terms."
tokens = count_tokens(prompt)
print(f"Token count: {tokens}")  # Token count: 7

# For chat messages, account for message overhead
def count_chat_tokens(messages, model="gpt-4o"):
    encoding = tiktoken.encoding_for_model(model)
    tokens_per_message = 3  # every message has overhead
    total = 0
    for message in messages:
        total += tokens_per_message
        for key, value in message.items():
            total += len(encoding.encode(value))
    total += 3  # reply priming
    return total

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]
print(count_chat_tokens(messages))  # ~19 tokens

Install with pip install tiktoken. The library downloads the tokenizer vocabulary on first use and caches it locally.

JavaScript: Using js-tiktoken or Approximation

For browser or Node.js environments, you have two options: an exact count with js-tiktoken or a fast approximation.

Exact Count with js-tiktoken

import { encodingForModel } from "js-tiktoken";

const enc = encodingForModel("gpt-4o");
const tokens = enc.encode("Hello, how are you?");
console.log(tokens.length); // 6

Fast Approximation (No Dependencies)

If you don't need exact counts — for example, for a UI estimate — this approximation is within 5–10% for English text:

function estimateTokens(text) {
  // English: ~4 chars per token on average
  // Adjust for whitespace and punctuation
  const charCount = text.length;
  const wordCount = text.split(/\s+/).filter(Boolean).length;
  return Math.ceil((charCount + wordCount) / 5);
}

console.log(estimateTokens("Hello, how are you?")); // ~5

This won't be accurate for code, non-English text, or text with lots of special characters. Use the exact library for anything billing-related.

Go: Using tiktoken-go

For Go services, the tiktoken-go package provides the same exact tokenization:

package main

import (
    "fmt"
    "github.com/pkoukk/tiktoken-go"
)

func countTokens(text, model string) (int, error) {
    enc, err := tiktoken.EncodingForModel(model)
    if err != nil {
        return 0, err
    }
    tokens := enc.Encode(text, nil, nil)
    return len(tokens), nil
}

func main() {
    count, _ := countTokens("Hello, world!", "gpt-4o")
    fmt.Printf("Tokens: %d\n", count) // Tokens: 4
}

When to Count Tokens

Count tokens at these key points in your application:

Before sending a request: Verify the total (system prompt + user input + expected output) fits within the context window
When building prompts dynamically: If you're injecting retrieved documents into a prompt, count as you add each chunk and stop before hitting the limit
For cost estimation: Show users an estimated cost before they confirm an expensive operation
In logging and monitoring: Track token usage per request to identify optimization opportunities

Handling the Context Window Budget

A practical pattern is to reserve space for the response and system prompt, then fill the remaining budget with user content:

MAX_CONTEXT = 128_000  # GPT-4o
RESERVED_OUTPUT = 4_096
SYSTEM_TOKENS = count_tokens(system_prompt)

available = MAX_CONTEXT - RESERVED_OUTPUT - SYSTEM_TOKENS
user_tokens = count_tokens(user_input)

if user_tokens > available:
    # Truncate or chunk the input
    user_input = truncate_to_tokens(user_input, available)

This ensures you never exceed the context window and always leave room for the model to respond.