what are ai reasoning tokens and how they work

What Are AI Reasoning Tokens

AI reasoning tokens are the invisible chain of thought a model generates before it answers you. You never see them. Providers still bill you for every single one. Understanding what are AI reasoning tokens is essential for anyone managing LLM costs at scale.

How the Two-Phase Token Pipeline Actually Works

Most developers assume they pay only for the visible output. That assumption creates expensive surprises.

The model follows a two-phase process. It first produces hidden reasoning tokens, then emits the final completion tokens. The architecture uses the same transformer weights for both phases. The only difference is that reasoning tokens are deliberately withheld from the client response.

Phase 1: Hidden Reasoning (the Tokens You Never See)

The model builds an internal monologue, tests assumptions, explores multiple paths, and backtracks on failures. None of this reaches your screen. Providers expose only summary counts in the API metadata through the reasoning_tokens field.

This creates the core hidden cost of reasoning tokens. You pay for substantial compute that leaves no trace in the final message.

Practical takeaway: Log the reasoning_tokens field on every single API call. Treat it as a first-class observability signal.

Phase 2: Visible Completion (the Tokens You Think You're Paying For)

The second phase generates the answer you see. These tokens typically cost less per thousand. A request that appears to use 800 output tokens may have consumed 8,000 reasoning tokens first. Your bill reflects the total, not the visible portion.

Token Volume Per Request: 200 for a Lookup, 50,000 for a Proof

Token counts vary dramatically by task type. Simple factual lookups often stay under 500 reasoning tokens. Complex proofs can exceed 20,000.

The Complexity Ladder: Factual Lookup → Code Gen → Competition Math

Factual lookup: Minimal reasoning required
Code generation: Multiple implementation paths tested internally
Competition math: Formal proofs, lemma checking, and discarding false starts

Identical-looking prompts can trigger 10× different reasoning token counts. The determining factor isn't prompt length but the depth of inference the model must perform.

Qwen2.5-14B on GPQA: 1,200 - 1,800 reasoning tokens per typical question, based on public measurements.

[IMAGE: Chart showing reasoning token volume by task complexity | AI reasoning tokens by task type chart]

The Real Cost Math: $0.015 vs. $0.003 per 1K Tokens at Scale

Go deeper

Download our free AI prompt engineering reference cards.

Get Free Resources →

OpenAI and other providers charge significantly higher rates for reasoning tokens. The multiplier can reach 5× compared to standard output tokens.

Worked Example: At 50,000 queries per month, a standard model might cost $1,200 while the same workload on a reasoning-heavy model reaches $6,800 when hidden tokens dominate. These figures come from measured production workloads.

Fine-tuning on reasoning traces creates additional risk. Models learn to emit long monologues even when unnecessary, inflating output tokens by 400 - 600% in reported cases.

Accuracy Gains vs. Token Overhead: The 5% / 5.3× Tradeoff Curve

Reasoning models deliver real accuracy improvements, but the cost curve is steep and convex.

Early o1 models moved from the 11th percentile to the 89th on Codeforces benchmarks. On GPQA Diamond, accuracy increased from 38.2% to 47.3% at 5.3× the token cost. Each additional point of accuracy becomes progressively more expensive.

GPQA Diamond benchmark results

When Reasoning Tokens Are Worth the Premium (and When They're Waste)

High-value use cases:

Medical diagnosis
Legal analysis
Competition-level code
Complex multi-step reasoning

Low-value traps:

Factual lookups
Classification tasks
Template generation

Using reasoning models on simple tasks simply burns money.

Risk-Managed Implementation: Cut Reasoning Token Spend by 75%

The Routing Pattern: Send requests to a standard model first. Only escalate to a reasoning model when confidence is low or the task matches known high-difficulty categories. Teams using this pattern report 60 - 75% lower monthly bills.

Conditional Token Selection (CTS)

This technique stops reasoning once the model reaches sufficient confidence rather than running to a fixed length. Research shows 75.8% reduction in token usage with only a 5% accuracy drop when properly tuned.

Prompt Engineering for Reasoning Models

Explicitly constrain chain length in system prompts. Instructions such as “keep reasoning under 2,000 tokens” are often respected and deliver better cost-accuracy tradeoffs than expected.

Monitoring Actual Token Usage

Check completion_tokens_details.reasoning_tokens (OpenAI) or equivalent fields on every response. Log three values per request: output tokens, reasoning tokens, and task category. Set alerts when the reasoning-to-output ratio exceeds your threshold.

Practical takeaway: Build per-request cost attribution instead of relying on averages. Track p95 and p99 reasoning token counts by task type. Averages hide dangerous outliers.

The assumption that pricing tables give you the full picture doesn't survive contact with production data. Route simple tasks to cheap models and reserve reasoning capacity for problems that actually need it. Your accuracy stays high while your costs fall.

For implementation patterns and monitoring templates, see Guides & Research. Written by Josh Ausmus - Author.

How the Two-Phase Token Pipeline Actually Works

Token Volume Per Request: 200 for a Lookup, 50,000 for a Proof

The Complexity Ladder: Factual Lookup → Code Gen → Competition Math

The Real Cost Math: $0.015 vs. $0.003 per 1K Tokens at Scale

Accuracy Gains vs. Token Overhead: The 5% / 5.3× Tradeoff Curve

When Reasoning Tokens Are Worth the Premium (and When They're Waste)

Risk-Managed Implementation: Cut Reasoning Token Spend by 75%

Conditional Token Selection (CTS)

Prompt Engineering for Reasoning Models

Monitoring Actual Token Usage

Keep reading.

ai agent development cost breakdown: risks & mitigation

AI Prompt Engineering Cheat Sheet for Developers 2026

ai model cost per token 2026: 70% Traffic to Wrong Model

what are ai reasoning tokens and how they work

How the Two-Phase Token Pipeline Actually Works

Token Volume Per Request: 200 for a Lookup, 50,000 for a Proof

The Complexity Ladder: Factual Lookup → Code Gen → Competition Math

The Real Cost Math: $0.015 vs. $0.003 per 1K Tokens at Scale

Accuracy Gains vs. Token Overhead: The 5% / 5.3× Tradeoff Curve

When Reasoning Tokens Are Worth the Premium (and When They're Waste)

Risk-Managed Implementation: Cut Reasoning Token Spend by 75%

Conditional Token Selection (CTS)

Prompt Engineering for Reasoning Models

Monitoring Actual Token Usage

Keep reading.

ai agent development cost breakdown: risks & mitigation

AI Prompt Engineering Cheat Sheet for Developers 2026

ai model cost per token 2026: 70% Traffic to Wrong Model

Get the weekly briefing.