"Should I fine-tune or use RAG?"
I've been asked this question 47 times in the past 3 months. Every time, the person asking thinks they need fine-tuning. About 90% of the time, they actually need RAG.
The confusion makes sense. Both techniques let you teach an AI model about your specific data. Both improve accuracy. Both cost money. But they solve completely different problems, and choosing wrong wastes weeks of engineering time and thousands of dollars.
After fine-tuning 12 models and building 8 RAG systems in production, here's the framework I use to decide which approach to use—and when you might need both.
The Core Difference
Fine-Tuning: Changes How the Model Thinks
Fine-tuning is retraining a model on your specific data. You're updating the model's weights (its internal parameters) to make it behave differently.
What it does:
- Changes the model's style, tone, or format
- Teaches the model new patterns or structures
- Makes the model follow specific instructions better
- Adjusts the model's personality or behavior
What it doesn't do:
- Add new factual knowledge reliably
- Keep information up-to-date
- Scale to millions of documents
- Let you easily update information
Example use case:
Before fine-tuning: "The server is experiencing issues."
After fine-tuning: "Incident detected: API gateway responding with 503 errors. Auto-scaling triggered. ETA: 2 minutes."
The model learned to output in your company's incident report format.
RAG: Gives the Model Access to Information
RAG (Retrieval Augmented Generation) gives the model access to external knowledge at query time. You're not changing the model—you're giving it a search engine.
What it does:
- Lets the model access specific documents or data
- Keeps information current (update the database, not the model)
- Scales to millions of documents
- Provides sources and citations
- Works with any model (no retraining needed)
What it doesn't do:
- Change the model's output style or format
- Teach new reasoning patterns
- Work well without good retrieval quality
- Eliminate the need for prompt engineering
Example use case:
User: "What's our return policy for electronics?"
RAG system:
1. Searches knowledge base for "return policy electronics"
2. Finds relevant policy document
3. Feeds document + question to model
4. Model answers based on retrieved information
The Decision Matrix
Use this flowchart:
Do you need to change the MODEL'S BEHAVIOR?
│
├─ YES: Consider Fine-Tuning
│ │
│ └─ Do you have 500+ high-quality examples?
│ │
│ ├─ YES: Fine-tuning will likely work
│ └─ NO: Use few-shot prompting instead
│
└─ NO: Do you need to give the model ACCESS TO INFORMATION?
│
└─ YES: Use RAG
│
└─ Is the information constantly changing?
│
├─ YES: Definitely RAG
└─ NO: Still probably RAG (easier to maintain)
When to Use Fine-Tuning
Use Case 1: Consistent Output Format
Problem: You need every response in a specific JSON structure.
Example:
# Without fine-tuning: Inconsistent structure
{
"answer": "The server is down",
"severity": "high"
}
# Sometimes it outputs:
{
"response": "The server is down",
"priority": "P1"
}
# After fine-tuning: Consistent every time
{
"status": "down",
"severity": "critical",
"affected_services": ["api", "web"],
"estimated_recovery": "15 minutes"
}
Why fine-tuning works: The model learns the exact output structure from hundreds of examples.
Use Case 2: Custom Tone or Brand Voice
Problem: You need the model to write in your company's specific style.
Example (legal tech):
Standard GPT-4: "You might want to consider reviewing the contract."
Fine-tuned model: "Pursuant to Section 3.2(a), we recommend immediate review of the Master Service Agreement dated March 1, 2026, with particular attention to the liability cap provisions."
Why fine-tuning works: Tone and style are baked into the model's weights.
Use Case 3: Specialized Task Performance
Problem: The model needs to perform a specific task exceptionally well.
Examples:
- Code generation in your company's coding style
- Medical diagnosis following specific protocols
- Financial analysis with your firm's methodology
- Customer support responses matching your brand guidelines
Data needed: 500-10,000 examples of the task done correctly.
Use Case 4: Reducing Prompt Complexity
Problem: Your prompt is 3,000 tokens of instructions and examples.
Before fine-tuning:
prompt = f"""
You are a customer support agent for TechCorp.
[2,000 tokens of instructions, examples, edge cases, formatting rules]
User question: {question}
"""
Cost per call: $0.09 (mostly prompt tokens)
After fine-tuning:
prompt = f"User question: {question}"
Cost per call: $0.01 (90% reduction)
Why fine-tuning works: The instructions are embedded in the model. You don't need to repeat them.
When to Use RAG
Use Case 1: Internal Knowledge Base
Problem: You need the AI to answer questions about company docs, policies, procedures.
Example:
User: "What's our PTO policy for employees in California?"
RAG system:
1. Searches HR policy database
2. Finds California-specific PTO policy (updated last month)
3. Model answers based on current policy
Why RAG works:
- Policies change frequently (fine-tuning would be outdated immediately)
- You can cite sources ("According to HR Policy Doc #423...")
- Scales to thousands of documents
- No retraining needed when policies update
Use Case 2: Customer Support with Product Information
Problem: Support agents need information about 5,000+ products.
Fine-tuning approach:
- Train model on all product info
- Cost: $800-2,000
- When product changes: Retrain ($800-2,000 again)
- Model might hallucinate product details
RAG approach:
- Store product info in vector database
- Cost: $50/month
- When product changes: Update database (instant)
- Model only answers from retrieved docs (no hallucinations)
Winner: RAG (by a mile)
Use Case 3: Document Q&A
Problem: Users upload PDFs and ask questions about them.
Example:
User uploads 200-page contract, asks:
"What are the termination clauses?"
RAG system:
1. Chunks document into sections
2. Finds sections mentioning "termination"
3. Sends relevant sections to model
4. Model summarizes termination clauses
Why fine-tuning doesn't work: You can't fine-tune a model for every document users upload.
Use Case 4: Real-Time Information
Problem: The model needs current information (news, stock prices, weather, etc.)
Fine-tuning: Model's knowledge is frozen at training time. RAG: Fetches current information from live APIs or databases.
Winner: RAG (only option that works)
Real Cost Comparison
Scenario: Customer Support Bot
Requirements:
- 10,000 product SKUs
- 500 support articles
- 1,000 FAQs
- Updated weekly
Fine-Tuning Approach
Initial setup:
- Data preparation: 40 hours ($6,000)
- Training: $1,200 (OpenAI fine-tuning cost)
- Testing and iteration: 20 hours ($3,000)
- Total: $10,200
Ongoing:
- Weekly retraining: $1,200/week = $4,800/month
- Data prep for updates: 10 hours/week = $6,000/month
- Monthly cost: $10,800
RAG Approach
Initial setup:
- Vector database setup: 8 hours ($1,200)
- Embedding generation: $50 (one-time)
- Integration: 12 hours ($1,800)
- Total: $3,050
Ongoing:
- Vector database hosting: $100/month
- Embedding updates: $10/month
- Monthly cost: $110
Savings: $10,690/month (99% cheaper)
When to Use BOTH
Some problems need both fine-tuning and RAG:
Example: Legal Document Analysis
Fine-tuning handles:
- Output format (structured JSON with citations)
- Legal reasoning patterns
- Citation style and format
- Specific terminology and phrasing
RAG handles:
- Access to case law database
- Current legal precedents
- Client-specific documents
- Jurisdiction-specific regulations
Architecture:
User question
↓
[RAG] Retrieve relevant legal documents
↓
[Fine-tuned model] Analyze using legal reasoning patterns
↓
Structured legal opinion with citations
Example: Code Generation in Company Style
Fine-tuning handles:
- Coding style (naming conventions, structure)
- Internal library usage patterns
- Error handling standards
- Documentation format
RAG handles:
- Internal API documentation
- Recent code examples from the repo
- Architecture decision records
- Library version specifics
Result: Code that matches your style AND uses current APIs correctly.
The Hybrid Pattern
Here's the architecture I use most often:
def hybrid_ai_system(user_query):
# Step 1: RAG retrieves relevant information
relevant_docs = vector_db.search(
query=user_query,
top_k=5
)
# Step 2: Fine-tuned model processes with context
response = fine_tuned_model.generate(
system_prompt=get_company_style_prompt(),
context=relevant_docs,
query=user_query,
format="company_standard_json"
)
return response
Benefits:
- RAG keeps information current
- Fine-tuning ensures consistent output format
- Best of both worlds
Common Mistakes
Mistake 1: Fine-Tuning for Knowledge
Bad idea:
"I'll fine-tune GPT-4 on our entire product catalog so it knows about all our products."
Why it fails:
- Models struggle to memorize facts through fine-tuning
- They hallucinate similar but wrong information
- Updates require expensive retraining
- No way to cite sources
Right approach: Use RAG for knowledge, fine-tuning for behavior.
Mistake 2: Using RAG for Style
Bad idea:
"I'll put examples of our writing style in the vector database and retrieve them with every query."
Why it fails:
- Retrieval is too slow for style guidance
- Examples in context window waste tokens
- Inconsistent style based on which examples are retrieved
Right approach: Fine-tune for consistent style/format.
Mistake 3: Not Enough Fine-Tuning Data
Bad idea:
"I have 50 examples, I'll fine-tune GPT-4 on them."
Why it fails:
- Minimum for reasonable results: 500 examples
- Good results: 1,000-10,000 examples
- 50 examples = unstable, poor quality
Right approach: Use few-shot prompting (include examples in prompt) until you have enough data.
Mistake 4: Poor RAG Retrieval Quality
Bad idea:
"I'll just dump all our docs into a vector database and let the AI figure it out."
Why it fails:
- If retrieval returns irrelevant documents, the AI will give irrelevant answers
- Garbage in = garbage out
Right approach:
- Chunk documents strategically
- Use metadata filters
- Test retrieval quality before adding generation
- Consider hybrid search (vector + keyword)
Decision Checklist
Answer these questions:
1. What are you trying to change?
- How the model thinks/responds → Fine-tuning
- What information the model has access to → RAG
2. How often does the information change?
- Rarely (less than monthly) → Consider fine-tuning
- Frequently (weekly or more) → Definitely RAG
3. How much data do you have?
- Less than 500 examples → Few-shot prompting, not fine-tuning
- 500-10,000 examples → Fine-tuning is viable
- 10,000+ documents → RAG is the only scalable option
4. Do you need citations/sources?
- Yes → RAG (fine-tuning can't provide sources)
- No → Either could work
5. What's your budget?
- Limited → RAG (10-100x cheaper ongoing)
- Large budget for quality → Consider both
6. How fast do you need updates?
- Immediate → RAG (update database instantly)
- Can wait days/weeks → Fine-tuning acceptable
How to Start with RAG
Step 1: Set Up Vector Database
from pinecone import Pinecone
# Initialize
pc = Pinecone(api_key="your-api-key")
index = pc.Index("knowledge-base")
# Store documents
for doc in documents:
embedding = get_embedding(doc.text) # OpenAI ada-002
index.upsert([(doc.id, embedding, {"text": doc.text})])
Cost: $0.096 per 1M queries (Pinecone serverless)
Step 2: Retrieve Relevant Docs
def search(query, top_k=5):
query_embedding = get_embedding(query)
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
return [match.metadata["text"] for match in results.matches]
Step 3: Generate with Context
def rag_answer(question):
# Retrieve
context_docs = search(question, top_k=3)
context = "\n\n".join(context_docs)
# Generate
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{
"role": "system",
"content": "Answer questions based only on the provided context."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
]
)
return response.choices[0].message.content
Time to first working prototype: 2-4 hours
How to Start with Fine-Tuning
Step 1: Prepare Training Data
Format (OpenAI):
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "4"}]}
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 3+3?"}, {"role": "assistant", "content": "6"}]}
Minimum: 500 examples Recommended: 1,000-10,000 examples
Use my Fine-Tuning Data Formatter to convert CSV/JSON to JSONL.
Step 2: Upload and Train
from openai import OpenAI
client = OpenAI()
# Upload training file
file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
# Start fine-tuning
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-2024-08-06"
)
Cost: $25-200 depending on model and data size Time: 30 minutes to 6 hours
Step 3: Use Fine-Tuned Model
response = client.chat.completions.create(
model=job.fine_tuned_model, # Your custom model
messages=[{"role": "user", "content": "Your prompt"}]
)
Time to production: 2-3 weeks (including data prep and testing)
Advanced: RAG Optimization
Technique 1: Hybrid Search
Combine semantic search (vector) with keyword search:
def hybrid_search(query, top_k=5):
# Semantic search
semantic_results = vector_db.search(query, top_k=10)
# Keyword search
keyword_results = elasticsearch.search(query, top_k=10)
# Merge and re-rank
combined = merge_and_rerank(semantic_results, keyword_results)
return combined[:top_k]
Improvement: 15-30% better retrieval accuracy
Technique 2: Metadata Filtering
def filtered_search(query, category=None, date_after=None):
filter_dict = {}
if category:
filter_dict["category"] = category
if date_after:
filter_dict["date"] = {"$gte": date_after}
return index.query(
vector=get_embedding(query),
filter=filter_dict,
top_k=5
)
Use case: "Show me AWS articles from the last month"
Technique 3: Re-Ranking
from cohere import Client
cohere_client = Client(api_key="...")
def rerank_results(query, documents):
rerank_response = cohere_client.rerank(
query=query,
documents=documents,
top_n=3,
model="rerank-english-v3.0"
)
return [doc.document for doc in rerank_response.results]
Cost: $1 per 1,000 rerank calls Improvement: 20-40% better final answer quality
Tools for Fine-Tuning and RAG
Free Tools I Built
- Fine-Tuning Data Formatter - Convert CSV/JSON to JSONL for OpenAI, Anthropic, Together AI
- Token Counter - Estimate fine-tuning costs before training
- AI Model Comparison - Compare which models support fine-tuning
- JSON Schema Generator - Create schemas for structured fine-tuning outputs
All free, no signup, run in your browser.
Summary Table
| Factor | Fine-Tuning | RAG |
|---|---|---|
| Best for | Behavior, style, format | Facts, knowledge, documents |
| Cost (setup) | $1,000-10,000 | $100-1,000 |
| Cost (ongoing) | $500-5,000/month | $50-500/month |
| Update speed | Days to weeks | Instant |
| Data needed | 500-10,000 examples | Any amount |
| Provides sources | No | Yes |
| Scales to millions of docs | No | Yes |
| Changes model behavior | Yes | No |
| Time to production | 2-4 weeks | 3-7 days |
What to Do Next
If you're just starting:
- Try RAG first - 90% of use cases need RAG, not fine-tuning
- Use existing tools - Pinecone, Weaviate, or Qdrant for vector databases
- Start simple - Basic RAG before optimization
- Measure retrieval quality - Test retrieval before adding generation
If you definitely need fine-tuning:
- Collect 1,000+ examples - Quality over quantity
- Use the data formatter tool - Get format right first time
- Start with a small model - GPT-3.5 is cheaper for testing
- Measure before/after - Quantify improvement
If you need both:
- Build RAG first - Easier to iterate
- Add fine-tuning later - Once you understand the problem
- Optimize separately - Don't optimize both at once
Further Reading:
- How to Use AI Coding Assistants for Infrastructure
- LLM Context Windows Explained
- Fine-Tuning Data Formatter Tool
- Complete AI Model Comparison
Still not sure which to use? Email me your specific use case: phaqqani@gmail.com or connect on LinkedIn.