When to Use RAG, Fine-Tuning, Agents, or Just Better Prompting
Text-Based Decision Flowchart
Start: Can better prompting solve this?
├── Yes → Use Prompting
│ (Low volume, no private data, simple task)
└── No
├── Does it need fresh or private knowledge?
│ ├── Yes → Use RAG
│ │ (Docs, manuals, customer data that changes)
│ └── No
│ ├── Is it a narrow behavior or style change?
│ │ ├── Yes → Consider Fine-Tuning
│ │ └── No
│ └── Does it require multi-step tool use or planning?
│ ├── Yes → Use Agents (or Agentic RAG)
│ └── No → Back to better prompting + hybrid
Stop at the first viable path. Hybrids win in production.[1]
Comparison Table
| Approach | Cost to Start | Ongoing Cost | Latency | Accuracy Ceiling | Setup Time | Maintenance | Best For |
|---|---|---|---|---|---|---|---|
| Prompting | Near zero | Token cost only | Lowest | Low-medium | Hours | Prompt tweaks | Simple tasks, quick prototypes |
| RAG | Moderate (vector DB + embeddings) | Storage + retrieval + tokens | Medium | High with good data | Days-weeks | Data updates | Knowledge-heavy, changing facts |
| Fine-Tuning | High (data prep + training) | Inference on tuned model | Low-medium | Very high for narrow tasks | Weeks | Retrain periodically | Style, tone, consistent reasoning |
| Agents | High (tools + orchestration) | Highest (multi-turn + tool calls) | Highest | High on complex workflows | Weeks-months | Tool upkeep + eval | Multi-step, tool-using workflows |
Prompting first. Everything else adds complexity.[2]
Checklist: Signs You Need RAG (5-6 items)
- Your answers must cite specific documents or private data.
- Facts change often (policies, prices, inventory).
- Hallucinations on names, numbers, or recent events.
- You have 100+ pages of reference material.
- Users ask about specific records or files.
- You need source links or audit trail.
Use RAG. Fine-tuning can't keep up with fresh data.
Checklist: Signs You Need Fine-Tuning
- Model consistently gets tone, format, or reasoning pattern wrong.
- Task is narrow and repetitive (classification, extraction, specific jargon).
- You have 1K+ high-quality labeled examples.
- Latency budget is tight and you can't afford extra context.
- Prompt is already 2K+ tokens and still flaky.
- You need the model to "just know" something without retrieval.
Fine-tuning bakes it in. Good for stable behavior.
Checklist: Signs You Need Agents
- Task requires multiple discrete steps or conditional branching.
- Needs to call external tools, APIs, or databases in sequence.
- Involves planning, self-correction, or iteration.
- Simple prompt or RAG fails on long horizons.
- Workflow looks like "research then act then verify."
- You accept higher cost and failure rate for autonomy.
Agents add loops. Use sparingly.
Cost Comparison Table (Fine-Tuning, ~3 epochs assumed)
Training cost estimates per 1M training tokens (2026 data)
| Examples (tokens) | OpenAI (smaller models) | Anthropic | Open Source (self-hosted, e.g. Llama on cloud GPUs) |
|---|---|---|---|
| 1K examples (~1M tokens) | $3-8 | Not offered (or very limited) | $0.5-3 (spot GPUs) |
| 10K examples (~10M tokens) | $30-80 | Not offered | $5-30 |
| 100K examples (~100M tokens) | $300-800+ | Not offered | $50-300+ (depends on cluster) |
OpenAI charges training tokens + inference on the tuned model. Anthropic focuses on prompt caching over full fine-tuning in most reports. Open source shifts cost to hardware and your time.[3]
Common Failure Modes
Prompting
- Prompt drift across model updates.
- Context window overflow.
- Inconsistent output format.
RAG
- Bad retrieval (irrelevant chunks).
- Lost in the middle (context ranking fails).
- Vector embedding mismatch on domain terms.
Fine-Tuning
- Catastrophic forgetting of general capabilities.
- Overfitting to training data quirks.
- Expensive to update when facts change.
Agents
- Infinite loops or tool thrashing.
- Error cascades across steps.
- High token burn with little progress.
Test simple prompting first on 50 real examples. Measure accuracy, latency, and cost. Move up the tree only when the numbers justify it. The signal chain here's prompt quality first, then data access, then behavior change, then orchestration. Most teams over-engineer early and pay for it later.