Jump to content
Jump to content
✓ Done
Home / Guides / LLM Fine-Tuning Decision Flowchart: When to Fine-Tune LLMs
JA
Technology · Mar 30, 2026 · 3 min read
LLM Fine-Tuning Decision Flowchart - AI/Tech data and analysis

LLM Fine-Tuning Decision Flowchart: When to Fine-Tune LLMs

· 3 min read

Decision Flowchart

Start
  |
  +-- Need persistent knowledge/behavior changes? 
      YES -> Data fits in context window?
             YES -> <100k examples? -> Prompt Engineering
             NO  -> RAG
      NO  -> >10k labeled examples? 
             YES -> Fine-tuning
             NO  -> Need complex multi-step reasoning?
                    YES -> Agents
                    NO  -> Prompt Engineering

Solution Comparison

Approach Cost Latency Accuracy Setup Time Maintenance When to Use
Fine-tuning $2k-30k+ Low High 2-6 weeks High Task-specific behaviors, consistent style/format
RAG $50-500/mo Medium Medium-High 2-5 days Low Knowledge-heavy tasks, fresh data needed
Prompt Eng $10-100/mo High Medium Hours Low Exploration, simple tasks, irregular usage
Agents $200-2k/mo Very High Variable 1-2 weeks Medium Complex workflows, tool use, multi-step tasks

Dataset Requirements

Go deeper
Download our free AI prompt engineering reference cards.
Get Free Resources →
Task Type Min Examples Format Quality Bar
Classification 1000/class {"input": "", "label": ""} 95% human agreement
Generation 5000 pairs {"prompt": "", "completion": ""} Expert-level output
Style Transfer 2000 pairs {"source": "", "target": ""} Consistent style
QA 3000 pairs {"question": "", "answer": ""} Factually perfect

Fine-tuning Costs (OpenAI)

Model Training Cost/1k examples Inference Cost/1k tokens
GPT-3.5 $0.80 $0.012
GPT-4 $2.40 $0.030
Davinci $0.60 $0.012

Evaluation Metrics

Metric Measures When to Use Gotchas
BLEU Token overlap Translation, generation Penalizes valid paraphrasing
ROUGE Summary quality Summarization Gaming via keyword stuffing
Perplexity Prediction confidence Language modeling Not for non-language tasks
Human Eval Overall quality Complex tasks Expensive, high variance

Common Failure Modes

Approach Failure Mode Detection Mitigation
Fine-tuning Catastrophic forgetting Eval on base tasks Mixture of experts
RAG Hallucinated citations Citation checking Strict validation
Prompt Eng Context overflow Token counting Chunk/summarize input
Agents Infinite loops Timeout monitoring Max step limits

Quick Validation Tests

# Fine-tuning readiness check
def validate_dataset(examples):
    assert len(examples) >= 1000, "Need 1000+ examples"
    assert max(len(x) for x in examples) < 4000, "Examples too long"
    assert len(set(x['label'] for x in examples)) >= 2, "Need multiple classes"

# RAG document validation
def validate_chunks(chunks):
    assert all(len(c) < 1000 for c in chunks), "Chunks too large"
    assert len(chunks) >= 100, "Need more context"
    assert len(set(c[:50] for c in chunks)) == len(chunks), "Duplicate chunks"

Red Flags

  • Fine-tuning with <1000 examples
  • RAG without document preprocessing
  • Prompt engineering beyond 2000 tokens
  • Agents without error recovery
  • Missing held-out test set
  • No automated evaluation metrics
JA
Founder, TruSentry Security | Technology Editor, EG3 · EG3

Founder of TruSentry Security. Installs the cameras, reads the datasheets, and writes about what the spec sheet got wrong.