Pricing Comparison
Use this table for quick cost math. Prices per 1M tokens. Context in April 2026, and Best benchmark pulled from available data.[1]
| Model | Input / Output | Context Window | Best Benchmark |
|---|---|---|---|
| Claude Opus 4.6 | $15 / $75 | 1M (beta) | 80.8% SWE-Bench |
| Claude Sonnet 4.6 | $3 / $15 | 1M (beta) | 79.6% SWE-Bench |
| Grok 4.20 | $3 / $15 | up to 2M | Real-time data + agents |
| GPT-5.4 | $2.50 / $15 | 1M | 87.3% investment benchmarks |
| Gemini 3.1 Pro | $2 / $12 | 1M | Multimodal leader |
Task Recommendations
Pick the model, and then follow the prompting tip exactly. These choices come from real benchmark gaps and failure modes I see on installs.
- Coding | Claude Sonnet 4.6 | Near-Opus SWE-Bench at 1/5th the price. Handles multi-file refactors without overengineering. | Wrap specs in XML tags. Add with file paths and key functions. End with "Output only the minimal diff in tags." On a recent install this cut token use 40%.
- Reasoning | Claude Opus 4.6 | Highest GPQA and sustained chain-of-thought. Sonnet gets close until the problem needs 30+ steps. | Use tags for scratch space. Force one idea per block. Prompt: "Think step by step inside . Only output final answer after."
- Creative | GPT-5.4 | Strongest on open-ended generation and stylistic consistency. Less censored than Claude. | Give 3-5 example snippets in the system prompt. Use JSON mode for structured output. Tell it "Match the voice and density of these examples exactly."
- Agentic workflows | Grok 4.20 | Real-time data, parallel agents, and OpenAI-compatible tools. Handles dynamic web tasks better. | Structure as parallel tool calls. Give it explicit "search first, then reason" in the system message. Use its web search native when possible.
- Long context | Grok 4.20 or Gemini 3.1 Pro | 2M on Grok. 1M reliable on both. Claude 1M beta still loses coherence past ~600k in practice. | Put the entire document first. Then the question. Use "Focus only on sections X and Y" to fight needle-in-haystack degradation. Test with known facts placed at different depths.
- Multimodal | Gemini 3.1 Pro | Native vision and video. Others bolt it on. | Describe images in the same prompt stream. Chain vision then text reasoning in one call. Avoid separate OCR step.
- Budget-sensitive | Gemini 3.1 Pro then Claude Sonnet 4.6 | $2/$12 input beats most volume work. Sonnet wins when you need clean code or XML structure. | Default to Gemini for analysis. Route code to Sonnet. Never use Opus unless the task fails twice on Sonnet.
Key API Differences
Prompting style changes output quality more than people admit. Match the model's native format or you waste tokens on corrections.
Claude (XML tags, content blocks) Use this exact skeleton. It respects the tags and reduces hallucinations.
<instructions>
Your role and rules here.
</instructions>
<task>
The actual work.
</task>
<example>
One full input-output pair.
</example>
Claude loves thinking blocks. Force them. It cuts refusal rates.
GPT (markdown, function calling) Stick to markdown headers and JSON mode. Function calling works cleanly.
You are an expert X.
## Context
...
## Task
...
Respond with valid JSON only. Schema: { "reasoning": "...", "answer": "..." }
Use tools parameter for agents. GPT follows the schema tighter than Claude.
Grok (OpenAI-compatible + web search) Drop straight OpenAI code. Add web search instruction.
Use your real-time search tool first if the data might be after 2025.
Then reason.
Parallel tool calls are allowed and encouraged.
Its strength is current events and parallel execution. Lean into that.
Gemini (multimodal native) Put images, video, or audio directly in the content array. Describe them in text too for the reasoning chain.
Analyze this image and the attached PDF page.
First describe what you see. Then extract data. Finally reason.
Gemini handles mixed media in one shot. No separate vision model call needed.
Gotchas
- Claude refuses more on edge prompts. Add "This is a hypothetical engineering exercise" when it stalls.
- GPT-5.4 output tokens cost the same as Sonnet but you often get more verbose answers. Force "be concise."
- Long context pricing can double past certain thresholds on GPT and Gemini. Measure first.
- Grok 2M context sounds great until you realize most tool chains still top out at 128k-256k effective.
- Always test the same prompt across two models on your actual workload. Benchmarks lie on toy tasks.
If your workload is 70%+ coding, default to Sonnet 4.6 and only escalate to Opus on the 10% that fail. Everything else routes cheaper. The price gaps are large enough that wrong default model shows up on the bill fast.