Pricing Comparison
Use this table for quick cost math. Prices are per 1M tokens (input/output). Context windows are maximums. Benchmarks pulled from SWE-Bench Verified or closest equivalent as of April 2026.
| Model | Input / Output | Context Window | Best Benchmark |
|---|---|---|---|
| Claude Opus 4.6 | $15 / $75 | 1M | ~80.8% SWE-Bench Verified |
| Claude Sonnet 4.6 | $3 / $15 | 1M | 79.6% SWE-Bench Verified |
| Grok 4.20 | $3 / $15 | 256K-2M | Real-time data + parallel agents |
| GPT-5.4 | ~$2.50 / $15 | 1M | 87.3% investment benchmarks |
| Gemini 3.1 Pro | $2 / $12 | 1M+ (tiered) | Multimodal leader |
Task Recommendations
Pick the model first, then tune the prompt. Sonnet 4.6 wins most day-to-day work for the price. Opus only when you need the extra 1-2% on hard problems. Gemini for anything with images or video. Grok for live data or agent loops. GPT-5.4 for polished investment-grade analysis.
| Task | Best Model | Why | Prompting Tip |
|---|---|---|---|
| Coding | Claude Sonnet 4.6 | Near-Opus SWE-Bench score at one-fifth the price. Handles multi-file refactors cleanly. | Feed the full repo context. Use XML tags: <files>, <thinking>, <code>. Demand one change at a time. |
| Reasoning | Claude Opus 4.6 | Sustained long-chain thinking beats the others on complex planning. | Chain-of-thought inside <thinking> tags. Tell it "take 30 seconds to think" for hard problems. |
| Creative | GPT-5.4 | Best prose flow and stylistic range. | Give 3-5 example paragraphs in the system prompt. Ask for 3 variants then pick. |
| Agentic workflows | Grok 4.20 | Native parallel agents and real-time web access. | Use tool calls liberally. Keep session state short. Prompt: "Run these three tools in parallel then synthesize." |
| Long context | Claude Sonnet 4.6 or Grok 4.20 | Both handle 1M+ reliably. Sonnet has better recall on needle-in-haystack tests. | Summarize every 200k tokens. Place key facts at the beginning and end. |
| Multimodal | Gemini 3.1 Pro | Native image/video understanding leads the pack. | Upload images directly. Describe what you want analyzed in plain text. Avoid vague "what do you see". |
| Budget-sensitive | Claude Sonnet 4.6 | Best performance per dollar across most benchmarks. | Use it for everything except pure vision. Cache prompts when possible. |
Key API Differences
Claude, GPT, Grok, and Gemini don't speak the same prompting language. The differences show up in failure modes more than benchmark scores.
Claude (XML tags and content blocks) Claude responds best to structured XML.
<instructions>
You are an expert firmware engineer.
</instructions>
<context>
[paste relevant code or docs]
</context>
<task>
Refactor this driver for the new chip. Output only the changed functions.
</task>
Use <thinking> for reasoning. Put final answer in <answer> or <code>. It follows these blocks religiously, and Vague markdown prompts make it ramble.
GPT (markdown, function calling) GPT-5.4 likes clean markdown and system prompts that set tone. It handles JSON schema function calling better than the rest for agent loops. Prompt with "Use markdown. Show diffs with ```diff". For tools, define parameters clearly in the tool spec. It rarely ignores the schema.
Grok (OpenAI-compatible + web search) Drop-in replacement for OpenAI SDK in most cases. Add instructions like "search the web for latest datasheets" when you want real-time data. It handles parallel tool calls without extra coaxing. Prompt style sits between GPT and Claude. Avoid heavy XML.
Gemini (multimodal native) Gemini treats images and video as first-class inputs. Just upload the file and ask direct questions in the same message. It understands visual context faster than others. For text-only, it behaves like a slightly more verbose GPT. Use simple imperative sentences: "Analyze this schematic for noise issues."
Gotchas
- Claude refuses certain creative tasks more often. Prefix with "This is a fictional exercise."
- Grok's real-time search can inject stale or biased web results. Always cross-check critical facts.
- Gemini pricing jumps after 200k tokens. Keep prompts under that unless you need the full context.
- Sonnet 4.6 is close enough to Opus on most coding tasks that the $12 price difference rarely justifies the flagship.
If your workload is 80% coding, route everything to Sonnet 4.6. Only escalate to Opus on the 20% that actually fail. That split saves real money at scale.