Decision Flowchart
Start
|
+-- Need persistent knowledge/behavior changes?
YES -> Data fits in context window?
YES -> <100k examples? -> Prompt Engineering
NO -> RAG
NO -> >10k labeled examples?
YES -> Fine-tuning
NO -> Need complex multi-step reasoning?
YES -> Agents
NO -> Prompt Engineering
Solution Comparison
| Approach | Cost | Latency | Accuracy | Setup Time | Maintenance | When to Use |
|---|---|---|---|---|---|---|
| Fine-tuning | $2k-30k+ | Low | High | 2-6 weeks | High | Task-specific behaviors, consistent style/format |
| RAG | $50-500/mo | Medium | Medium-High | 2-5 days | Low | Knowledge-heavy tasks, fresh data needed |
| Prompt Eng | $10-100/mo | High | Medium | Hours | Low | Exploration, simple tasks, irregular usage |
| Agents | $200-2k/mo | Very High | Variable | 1-2 weeks | Medium | Complex workflows, tool use, multi-step tasks |
Dataset Requirements
| Task Type | Min Examples | Format | Quality Bar |
|---|---|---|---|
| Classification | 1000/class | {"input": "", "label": ""} |
95% human agreement |
| Generation | 5000 pairs | {"prompt": "", "completion": ""} |
Expert-level output |
| Style Transfer | 2000 pairs | {"source": "", "target": ""} |
Consistent style |
| QA | 3000 pairs | {"question": "", "answer": ""} |
Factually perfect |
Fine-tuning Costs (OpenAI)
| Model | Training Cost/1k examples | Inference Cost/1k tokens |
|---|---|---|
| GPT-3.5 | $0.80 | $0.012 |
| GPT-4 | $2.40 | $0.030 |
| Davinci | $0.60 | $0.012 |
Evaluation Metrics
| Metric | Measures | When to Use | Gotchas |
|---|---|---|---|
| BLEU | Token overlap | Translation, generation | Penalizes valid paraphrasing |
| ROUGE | Summary quality | Summarization | Gaming via keyword stuffing |
| Perplexity | Prediction confidence | Language modeling | Not for non-language tasks |
| Human Eval | Overall quality | Complex tasks | Expensive, high variance |
Common Failure Modes
| Approach | Failure Mode | Detection | Mitigation |
|---|---|---|---|
| Fine-tuning | Catastrophic forgetting | Eval on base tasks | Mixture of experts |
| RAG | Hallucinated citations | Citation checking | Strict validation |
| Prompt Eng | Context overflow | Token counting | Chunk/summarize input |
| Agents | Infinite loops | Timeout monitoring | Max step limits |
Quick Validation Tests
# Fine-tuning readiness check
def validate_dataset(examples):
assert len(examples) >= 1000, "Need 1000+ examples"
assert max(len(x) for x in examples) < 4000, "Examples too long"
assert len(set(x['label'] for x in examples)) >= 2, "Need multiple classes"
# RAG document validation
def validate_chunks(chunks):
assert all(len(c) < 1000 for c in chunks), "Chunks too large"
assert len(chunks) >= 100, "Need more context"
assert len(set(c[:50] for c in chunks)) == len(chunks), "Duplicate chunks"
Red Flags
- Fine-tuning with <1000 examples
- RAG without document preprocessing
- Prompt engineering beyond 2000 tokens
- Agents without error recovery
- Missing held-out test set
- No automated evaluation metrics


