Decision Tree (Text Flowchart)
Can better prompting solve this?
├── Yes → Stop here. Use advanced prompting + chain-of-thought + few-shot.
│
└── No
├── Does it need up-to-date or external facts/knowledge?
│ ├── Yes → Use RAG (or RAG + prompting).
│ │
│ └── No
│ ├── Does it need specialized behavior, style, or consistent output format on fixed tasks?
│ │ ├── Yes → Use fine-tuning.
│ │
│ └── No
│ └── Does it require multi-step reasoning, tool use, or autonomous decisions?
│ ├── Yes → Use agents (or agents + RAG).
│ └── No → Combine prompting + RAG first. Re-evaluate.
Comparison Table
| Approach | Cost to start | Ongoing cost | Latency | Accuracy ceiling | Setup time | Maintenance | Best for |
|---|---|---|---|---|---|---|---|
| Better Prompting | $0 | API tokens only | Lowest | Medium (hallucinations persist) | Hours | Low (prompt tweaks) | Prototypes, simple tasks, formatting |
| RAG | Low (vector DB) | Storage + retrieval + tokens | Medium | High on knowledge tasks | Days | Medium (data updates, chunking) | Knowledge bases, docs, Q&A with citations |
| Fine-Tuning | Medium | Training compute + higher inference | Lowest after training | Highest on narrow tasks | Weeks | High (retrain on drift) | Domain-specific style, JSON output, tone |
| Agents | Medium-High | Tokens × steps + tool calls | Highest | High on complex workflows | Weeks | High (debug loops, tool reliability) | Tool-using workflows, multi-step planning |
Signs You Need RAG (6 items)
- Knowledge changes often. Retraining is too slow.
- You must cite sources or avoid hallucinations on facts.
- Dataset is large (thousands of docs) and mostly static.
- Users ask about specific products, policies, or manuals.
- You need to keep data private but still query it.
- Prompting works until the model makes up details.
Signs You Need Fine-Tuning (6 items)
- Output must follow a rigid format (JSON, XML) every time.
- You want a consistent brand voice or writing style.
- Task is narrow and repetitive with clear input-output pairs.
- Inference latency matters more than update frequency.
- You have 1K+ high-quality labeled examples.
- Model ignores instructions even after heavy prompting.
Signs You Need Agents (6 items)
- Task involves multiple steps with conditional branching.
- You need to call external tools or APIs during reasoning.
- Goal is complex . Research, booking, code debugging.
- Single prompt fails but breaking into steps succeeds.
- You accept variable latency for better outcomes.
- Hallucination in planning is tolerable with verification steps.
Cost Comparison Table . Fine-Tuning at Different Scales
Approximate training cost in USD, and Assumes ~500 tokens per example average. OpenAI uses per-million-token training fees (roughly $8 - 25/M tokens processed depending on base model). Anthropic currently lacks public fine-tuning (API-only focus). Open-source uses LoRA on rented GPUs.
| Scale | OpenAI (e.g. GPT-5.4-mini class) | Anthropic | Open Source . LoRA on 70B, e.g. via Together/Fireworks |
|---|---|---|---|
| 1K examples | ~$20 - 80 | N/A | $5 - 30 (few hours on A100) |
| 10K examples | ~$200 - 800 | N/A | $50 - 300 (1 - 2 days) |
| 100K examples | ~$2K - 8K | N/A | $500 - 3K (multi-day or distributed) |
Ongoing inference for fine-tuned models runs 2 - 8× base API cost. Self-hosted open-source drops to hardware cost after training.[[1]]. Https://pricepertoken.com/fine-tuning
Common Failure Modes
Better Prompting
- Model still hallucinates facts.
- Prompt grows beyond context window.
- Inconsistent output across similar inputs.
- Breaks when instructions get complex.
RAG
- Bad chunking or embeddings → irrelevant retrieval.
- No citations or lost-in-the-middle problem.
- Vector DB gets stale without update pipeline.
- High token use from long retrieved context.
Fine-Tuning
- Catastrophic forgetting of general capabilities.
- Overfits to training data, poor on edge cases.
- Expensive to retrain when data drifts.
- Data preparation takes longer than expected.
Agents
- Loops forever or gets stuck in reasoning.
- Tool calls fail silently or with bad parameters.
- Latency explodes with more steps.
- Hard to debug without full trace logging.
Use prompting first. Add RAG when knowledge is the gap. Fine-tune when behavior is the gap. Deploy agents only when the workflow demands planning and tools. Most production wins come from good prompting plus RAG. The rest is usually overkill until scale proves otherwise.