When you pass structured data inside an LLM prompt, the format you choose directly affects your token count. JSON, YAML, and XML all represent the same information, but they tokenize very differently. Picking the right format can cut token usage by 30–50% with zero loss of information.
A Real Comparison
Let's encode the same data — two users with names and roles — in all three formats and count the tokens using GPT-4o's tokenizer (o200k_base).
JSON (38 tokens)
{
"users": [
{"name": "Alice", "role": "admin", "active": true},
{"name": "Bob", "role": "editor", "active": false}
]
}
YAML (23 tokens)
users:
- name: Alice
role: admin
active: true
- name: Bob
role: editor
active: false
XML (53 tokens)
<users>
<user>
<name>Alice</name>
<role>admin</role>
<active>true</active>
</user>
<user>
<name>Bob</name>
<role>editor</role>
<active>false</active>
</user>
</users>
The results are clear: YAML uses ~40% fewer tokens than JSON, and XML uses ~40% more. The gap widens as data grows — with 100 records, XML can use 2x the tokens of YAML.
Why the Difference?
The token cost comes down to syntax overhead:
- JSON requires quotes around every key and string value, plus braces, brackets, and commas. Each
"and{consumes a token. - YAML uses indentation and colons instead of delimiters. No quotes needed for simple strings. Fewer special characters means fewer tokens.
- XML repeats every tag name twice (opening and closing), and angle brackets tokenize as separate tokens. A field like
<name>Alice</name>uses 7 tokens where YAML'sname: Aliceuses 3.
The Even Cheaper Option: CSV
For tabular data, CSV beats all three formats:
name,role,active
Alice,admin,true
Bob,editor,false
This encodes the same data in roughly 14 tokens — 63% fewer than JSON. The tradeoff is that CSV can't represent nested structures, so it only works for flat data.
When to Use Each Format
The best format depends on your use case:
- YAML — Best for passing structured data in prompts where you control the format. Lowest token cost with full nesting support.
- JSON — Best when you need the model to output structured data. Models are more reliable at generating valid JSON than YAML, and most APIs expect JSON responses.
- CSV — Best for flat, tabular data like lists of products, users, or log entries. Minimal token overhead.
- XML — Avoid in prompts unless the model specifically needs XML context (e.g., processing SOAP APIs or HTML). The token cost is rarely justified.
Practical Tip: Mixed Strategy
Use YAML or CSV for input data in your prompts, and ask the model to respond in JSON. This gives you the cheapest input tokens (which you pay for on every request) while getting reliably parseable output.
# System prompt
Respond in JSON with fields: name, summary, score.
# User data (YAML — cheap input)
products:
- name: Widget Pro
reviews: 4.5 stars, 230 reviews
price: $29.99
- name: Gadget Plus
reviews: 3.8 stars, 89 reviews
price: $49.99
Rule of thumb: YAML for input, JSON for output, CSV for flat data. Avoid XML unless you have a specific reason.