From API Bills to Custom Models: The Fine-Tuning Playbook
In AI 09, we built quality gates. In AI 10, we built better interfaces. Now comes the question that every AI team eventually hits: the API bill.
Month 1: a GPT-4o prototype that impresses everyone. Month 6: 100 internal users and a $20,000/month API invoice. Month 12: the CFO on the phone.
This is the API cost wall — and it’s not just about money. It’s also about latency (2-5 seconds per call), data privacy (your financial data leaves your network), and vendor lock-in (raising prices any time). The solution isn’t to switch to a cheaper API. The solution is to stop renting intelligence you can own.
Fine-tuning lets you take the capability a large model demonstrates on your domain tasks and bake it permanently into a small model that costs 100× less to run, responds in milliseconds, and never leaves your network.
TL;DR
Use prompting and RAG for most things. Fine-tune when you need to change the model’s behavior — its style, format, reasoning patterns — or when cost and latency force you to stop using APIs.
The Customization Spectrum — Choose the Right Tool:
┌──────────────────────────────────────────────────────────────┐
│ AI Customization Spectrum │
│ │
│ Prompt (AI 01) RAG (AI 03) Fine-Tuning (AI 11) │
│ ║ ║ ║ │
│ What changes: What changes: What changes: │
│ Activation Context window Model weights │
│ │
│ Cost: Lowest Medium Highest upfront, │
│ Lowest per-query │
│ │
│ Best for: Best for: Best for: │
│ Task framing, Knowledge Style/format, │
│ few-shot injection, Behavior change, │
│ examples time-sensitive Privacy-required, │
│ (no training) data Low latency │
│ │
│ Persists: No No (query-time) Yes (weights = memory) │
└──────────────────────────────────────────────────────────────┘
Article Map
I — Theory Layer (when and why)
- The API Cost Wall — The three forces that push toward self-hosting
- When to Fine-tune vs. RAG vs. Prompt — Decision framework
- The Rise of SLMs — Small Language Models that punch above their weight
II — Technique Layer (how to do it) 4. Fine-Tuning Methods: Full, LoRA, QLoRA — The math and intuition 5. Data Preparation: The 80% of the Work — Quality beats quantity 6. Knowledge Distillation: Teacher → Student — From GPT-4o to Llama-8B
III — Engineering Layer (production) 7. Training Infrastructure — Where to train and what tools to use 8. Evaluation & Iteration — Connecting AI 09 to fine-tuning 9. Deployment: Serving Your Custom Model — vLLM, Ollama, llama.cpp 10. Cost Analysis: API vs. Self-Hosted — The ROI framework 11. Key Takeaways
1. The API Cost Wall
1.1 The Three Forces
Every AI application that finds product-market fit eventually faces three forces that make large-model APIs unsustainable at scale:
The Three Forces That Drive Teams to Fine-tuning:
Force 1: Cost
─────────────
GPT-4o: ~$10/1M input tokens + $30/1M output tokens
Financial AI (AI 08) monthly close, 1,000 users/day:
Each query: ~2,000 tokens in + ~500 tokens out
Daily cost: 1,000 × (2,000 × $0.01 + 500 × $0.03) / 1000 = $35/day
Annual: $12,775
At 10,000 users/day (Series B scale):
Annual: $127,750 ← HR budget for two engineers
Force 2: Latency
────────────────
GPT-4o API round-trip: 2-5 seconds (TTFT + generation)
For real-time financial workflows:
User submits bank statement
System extracts 847 transactions
Each transaction: 1 LLM call for classification
847 × 2 seconds = 28 minutes
With fine-tuned Llama-8B on-premise:
847 × 0.05 seconds = 42 seconds
Force 3: Privacy
────────────────
Every API call sends your data to a third-party server.
For financial AI:
├── Bank account numbers sent to OpenAI
├── Vendor invoices sent to Anthropic
├── Salary data sent to Google
└── Your GDPR/SOC2/ISO27001 auditors are not happy
Fine-tuned local model: zero data egress, ever.
1.2 The Math Doesn’t Lie
API Cost Scaling Reality:
Monthly Users GPT-4o API Cost Llama-8B Self-Hosted
───────────── ─────────────── ─────────────────────
100 $1,065/mo $150/mo (server)
1,000 $10,650/mo $150/mo
10,000 $106,500/mo $600/mo (4× GPU)
100,000 $1,065,000/mo $2,400/mo (16× GPU)
The crossover happens around 500 users/month.
Above that threshold: every dollar of API cost is a dollar
you could be spending on your own infrastructure.
🔧 Engineer’s Note: The API bill is a tax on not owning your models. Every team eventually pays it. The question is whether you pay it forever or invest the equivalent in fine-tuning and infrastructure that pays you back for years. The amortized cost of a fine-tuned Llama-8B running on a $4,000 server drops to near zero after 12 months of operation — while API costs grow linearly with users.
2. When to Fine-tune vs. RAG vs. Prompt
2.1 The Decision Framework
The most common mistake in AI engineering is reaching for fine-tuning when another technique would work better, faster, and cheaper. The second-most common mistake is not reaching for it when you should:
Decision Tree: Which Customization Technique?
START: What's your problem?
│
▼
Does the model lack factual knowledge (dates, docs, prices)?
└── YES → RAG (AI 03). Fine-tuning doesn't inject facts well.
│
▼
Is the issue format, style, domain-specific reasoning patterns?
└── YES → Fine-tuning is the right tool.
│
▼
Can few-shot examples in the system prompt solve it?
└── YES → Try prompting first. Zero training cost.
│
▼
Are you paying >$5,000/month in API costs?
└── YES → Fine-tuning break-even case is strong.
│
▼
Does your data privacy/regulation prohibit cloud APIs?
└── YES → Fine-tune + self-host. No discussion needed.
│
▼
Do you need <200ms inference latency?
└── YES → Self-hosted SLM required. Fine-tune for quality.
2.2 Side-by-Side Comparison
| Situation | Best Solution | Why Not Fine-tune Here? |
|---|---|---|
| Model needs IFRS 2025 updates | RAG (AI 03) | Can’t bake knowledge into weights — data changes |
| Model needs to classify like a CPA | Fine-tuning | Behavior pattern, not knowledge — perfect fit |
| Response format wrong | Prompting + structured output | Few-shot or response_format fixes this |
| Data privacy required | Fine-tune + self-host | Zero data leaves your network |
| <200ms latency required | Fine-tune + SLM | API latency irreducible; local SLM = fast |
| Domain vocabulary/jargon | Fine-tuning | ”Lexical alignment” = style change |
| 10,000 identical tasks/day | Fine-tuning | Amortized training cost trivial vs API savings |
| Occasional edge cases | RAG or prompting | Too rare to justify training data collection |
🔧 Engineer’s Note: The most common and expensive mistake: fine-tuning to inject knowledge. If you try to fine-tune a model to “know” that IFRS 16 was updated in 2023, you will: (1) spend weeks collecting training data, (2) train the model, (3) discover it still hallucinates the specific subsections, (4) realize RAG would have solved this in a weekend. Fine-tuning teaches how to think and respond. RAG teaches what facts to use. Use both, but don’t confuse them.
3. The Rise of SLMs
3.1 Small Language Models: Punching Above Their Weight
The frontier models (GPT-4o, Claude 3.7, Gemini 2.0 Ultra) are the most capable AI systems ever built. They are also expensive, slow, and require internet connectivity. For the narrow domain tasks where most enterprise AI value lives, they’re often overkill.
Small Language Models (SLMs) — typically 1B–14B parameters — have been closing the gap rapidly:
SLM Landscape (Open-Source, Fine-tunable):
┌────────────────────────────────────────────────────────────────────┐
│ Model │ Params │ Key Strength │ VRAM Needed │
├────────────────┼──────────┼─────────────────────-┼────────────────┤
│ Llama 3.2 │ 1B, 3B │ Mobile, edge deploy │ 2–4 GB │
│ Llama 3.1/3.3 │ 8B, 70B │ Versatile, strong │ 8–48 GB │
│ Phi-4 │ 14B │ Reasoning, STEM │ 12 GB │
│ Phi-3 │ 3.8B │ Code + reasoning │ 6 GB │
│ Gemma 2 │ 2B, 9B │ Multilingual │ 4–10 GB │
│ Qwen 2.5 │ 0.5–72B │ Chinese, code │ 2–48 GB │
│ Mistral 7B │ 7B │ European, efficient │ 8 GB │
│ DeepSeek-R1* │ 1.5–70B │ Reasoning distill │ 3–48 GB │
└────────────────┴──────────┴─────────────────────-┴────────────────┘
*DeepSeek-R1-Distill: Reasoning capability distilled from R1 into
smaller models — the best reasoning per parameter currently available.
Hardware to run them:
├── RTX 4090 (24 GB VRAM): Llama 8B comfortably, 70B with QLoRA
├── Apple M2 Max (96 GB unified): 70B locally, excellent performance
├── RTX 3090 (24 GB): Same as 4090, slightly slower
└── 2× A100 (80 GB each): 70B at full 16-bit precision
3.2 The Chinchilla Principle Applied
In AI 00 §7.5, we covered Chinchilla’s Law: optimal model performance comes from the right balance of parameters and training data, not just more parameters. The same principle applies to fine-tuning:
Chinchilla Applied to Fine-tuning:
A 7B model fine-tuned on 10,000 domain-specific examples
will outperform a 70B model prompted to do the same task.
Why?
┌────────────────────────────────────────────────────────┐
│ General (70B, zero-shot): │
│ "Knows" everything. Good at nothing specific. │
│ Generalizes across all domains. │
│ │
│ Specialist (7B, fine-tuned on your tasks): │
│ "Knows" your domain. Excellent at your tasks. │
│ Fails at unrelated tasks (but you don't need those). │
└────────────────────────────────────────────────────────┘
Medical diagnosis: Specialist physician > generalist doctor
Domain AI: Fine-tuned SLM > general frontier model
On your specific benchmark: fine-tuned 7B = frontier 70B
Cost per query: fine-tuned 7B = 0.1% of frontier API
🔧 Engineer’s Note: “Task-specific SLM ≈ Frontier LLM” for narrow domains is not marketing — it’s been empirically verified across financial NLP (FinBERT, BloombergGPT), medical Q&A (MedPaLM), legal reasoning (LexCompute), and coding (CodeLlama). The pattern is consistent: a 7B model fine-tuned on 5,000 domain-specific examples consistently reaches 85-95% of GPT-4o accuracy on the target task — while running 100× cheaper and 20× faster.
4. Fine-Tuning Methods: Full, LoRA, QLoRA
4.1 Full Fine-Tuning (Expensive, Rarely Needed)
Full fine-tuning updates all model parameters. For a 7B model: all 7 billion floating-point weights get gradients computed and updated on every training step.
Full Fine-Tuning:
Llama 3.1 8B: 8,000,000,000 parameters
Each in fp16: 2 bytes → model itself = 16 GB
Training memory per parameter:
├── Model weights: 16 GB
├── Optimizer states: 64 GB (Adam keeps 2 momentum estimates)
├── Gradients: 16 GB
└── Activations: ~32 GB (batch size dependent)
Total: ~128 GB VRAM for 8B model
Hardware needed: 2× A100-80GB ($20,000+)
Training 8B on 10k examples: ~$300 on cloud, ~8 hours
Risk: Catastrophic Forgetting
─────────────────────────────
Fine-tuning on domain data overwrites general knowledge.
After fine-tuning on financial texts only:
Q: "Who wrote Hamlet?" → Model: "Unable to process query" 😵
When to use: Never for most teams. Use LoRA instead.
4.2 LoRA: Low-Rank Adaptation (The Standard Approach)
LoRA adds small, trainable matrices alongside the frozen pretrained weights. Instead of updating the weight matrix W directly, it learns two small matrices A and B whose product approximates the update:
LoRA: The Math Made Intuitive
Without LoRA:
W_new = W_original + ΔW
Where ΔW is a d×d matrix (huge)
With LoRA:
W_new = W_original + B × A
Where:
B is (d × r): d = layer dimension, r = rank (tiny)
A is (r × d):
r is typically 8, 16, or 32
For Llama 8B attention layer: d = 4096
Without LoRA: ΔW size = 4096 × 4096 = 16,777,216 params
With LoRA r=16: A + B = 4096×16 + 16×4096 = 131,072 params
Parameter reduction: 131,072 / 16,777,216 = 0.78%
You're training 0.78% of the parameters.
You're capturing 85-95% of the fine-tuning effect.
Why does this work?
────────────────────
The insight: behavior change lives in a low-dimensional
subspace of the weight space. You don't need to update
every weight to change how the model classifies IFRS
transactions — the meaningful change occupies a tiny
fraction of the weight space.
Think of it like PCA (AI 00 §5.3):
PCA finds the top-k principal components of a dataset.
LoRA finds the top-r directions of weight change needed.
Both compress a high-dimensional problem into its
essential low-dimensional structure.
# LoRA implementation with HuggingFace PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
# Load base model (weights stay frozen)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
torch_dtype = "auto",
device_map = "auto",
)
# LoRA configuration
lora_config = LoraConfig(
task_type = TaskType.CAUSAL_LM,
r = 16, # rank — higher = more capacity, more VRAM
lora_alpha = 32, # scaling factor: ΔW contributes at (alpha/r) ratio
lora_dropout = 0.05, # regularization
target_modules = [ # which attention layers to adapt
"q_proj", # Query projection
"k_proj", # Key projection
"v_proj", # Value projection
"o_proj", # Output projection
],
bias = "none",
)
# Add LoRA adapters
model = get_peft_model(model, lora_config)
# Verify: only a tiny % of params are trainable
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 8,051,118,080
# trainable%: 0.2604%
# The base model is frozen — only A and B matrices train
# After training: save only the adapters (not the full model)
model.save_pretrained("./financial-ai-adapter")
# Saved: ~40 MB LoRA adapter (vs 16 GB full model)
4.2.1 Choosing the Right LoRA Rank (r)
Rank is the single most consequential hyperparameter in LoRA. Higher rank is not always better — it’s a capacity dial with real trade-offs:
LoRA Rank Selection Guide:
┌──────────────────────────────────────────────────────────────┐
│ r = 4–8 │ Format & Style Changes │
│ │ "Always respond in JSON", tone alignment │
│ │ VRAM impact: minimal │
│ │ Forgetting risk: very low │
├──────────────────────────────────────────────────────────────┤
│ r = 16–32 │ Domain Adaptation (Recommended Default) │
│ ← used │ IFRS classification, financial reasoning │
│ above │ VRAM impact: small │
│ │ Forgetting risk: low │
├──────────────────────────────────────────────────────────────┤
│ r = 64 │ Deep Reasoning Behavior Change │
│ │ Complex multi-hop inference patterns │
│ │ VRAM impact: moderate │
│ │ Forgetting risk: medium — test carefully │
├──────────────────────────────────────────────────────────────┤
│ r = 128+ │ Rarely justified │
│ │ LoRA loses its compression advantage │
│ │ At r ≥ d/2: effectively full fine-tuning │
│ │ Forgetting risk: high │
└──────────────────────────────────────────────────────────────┘
Rule of thumb:
├── Start at r=16. Run eval (AI 09 §9).
├── If domain accuracy is too low → increase to r=32 or r=64.
├── If forgetting check fails → decrease rank or add general
│ domain examples (10% mix-in) to training data.
└── If r=64 and still not converging → the problem is data, not rank.
Larger r ≠ better model. Larger r = less compression + more
forgetting risk. Only increase rank if your eval data shows
that the lower rank genuinely underfits your task.
4.3 QLoRA: Quantized LoRA (The Game Changer for Individual Developers)
QLoRA combines 4-bit quantization of the base model with 16-bit LoRA training. This reduces VRAM requirements dramatically:
QLoRA Memory Comparison:
Fine-tuning Llama 3.1 70B:
┌──────────────────────────────────────────────────────┐
│ Method │ VRAM Required │ Hardware │
├──────────────────────────────────────────────────────┤
│ Full fine-tune │ ~700 GB VRAM │ 10× A100 ($200k+) │
│ LoRA (fp16) │ ~160 GB VRAM │ 4× A100 ($80k+) │
│ QLoRA (4-bit) │ ~40 GB VRAM │ 2× RTX 4090 ($3k!) │
└──────────────────────────────────────────────────────┘
QLoRA fine-tuning a 70B model on a gaming GPU.
This was impossible 18 months ago.
How QLoRA works:
Step 1: Load base model in 4-bit NF4 (NormalFloat4)
Normal FP16 weight: 1.234567... (16 bits)
NF4 quantized weight: roughly maps to 1 of 16 buckets (4 bits)
Size reduction: 4× smaller
Quality loss: <1% on downstream tasks (remarkably small)
Step 2: Keep LoRA adapters in BF16 (16-bit)
High-precision training gradients for the small A/B matrices
Base model contributes frozen quantized forward pass
Step 3: Double quantization
Quantize the quantization constants themselves
Additional ~1 GB savings on 70B model
# QLoRA setup with bitsandbytes
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_quant_type = "nf4", # NormalFloat4
bnb_4bit_compute_dtype = torch.bfloat16,
bnb_4bit_use_double_quant = True, # extra savings
)
# Load 70B model in 4-bit — works on 2× RTX 4090!
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-70B-Instruct",
quantization_config = bnb_config,
device_map = "auto", # multi-GPU auto-split
)
# Apply LoRA on top of quantized model
lora_config = LoraConfig(
r = 64, # larger rank for 70B
lora_alpha = 128,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout = 0.1,
task_type = "CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# 0.42% trainable parameters — fine-tuning 70B on 2 gaming GPUs
🔧 Engineer’s Note: QLoRA is the individual developer’s game changer. Before QLoRA (2023), fine-tuning a 70B model required a 3,200), 40 GB VRAM, and a weekend. The quality degradation from 4-bit quantization is ~0.5% on most benchmarks — smaller than the variance from batch-to-batch training runs. For 7B-13B models: a single RTX 4090 is sufficient. For 70B models: two RTX 4090s or an M2 Max Mac Studio.
4.4 DPO: When “Correct” Is Hard to Define
Supervised Fine-Tuning (SFT — what §4.2/§4.3 cover) trains the model to reproduce your labeled examples: given this input, produce this exact output. It works well when the correct answer is unambiguous.
But for tasks involving judgment, style, tone, or complex reasoning — where multiple outputs could be “correct” but some are clearly better — SFT has a ceiling. Direct Preference Optimization (DPO) outperforms it.
SFT vs. DPO: What Each Optimizes
SFT (Supervised Fine-Tuning):
──────────────────────────────
Training data: {input, correct_output}
Objective: maximize probability of correct_output
Works best when:
✔ Answer is objectively right or wrong
✔ Format must match exactly (JSON schema, code)
✔ Domain vocabulary must be used consistently
Ceiling: hard to teach "professional tone" or
"IFRS-nuanced reasoning" with SFT alone.
DPO (Direct Preference Optimization):
──────────────────────────────────
Training data: {input, chosen_output, rejected_output}
Objective: maximize P(chosen) / P(rejected)
Works best when:
✔ "Better"-ness is easier to judge than "correct"-ness
✔ Reducing hallucination is the primary goal
✔ Teaching nuanced professional judgment
Why it reduces hallucination:
SFT: model learns to produce text that looks like the answer
DPO: model learns to prefer grounded answers (chosen) over
confident-but-wrong answers (rejected)
→ The model internalizes "when uncertain, hedge or escalate"
# DPO training data format: pairs of chosen/rejected
# {input, chosen (better), rejected (worse)}
# Example: financial AI judgment
DPO_EXAMPLE = {
"prompt": [
{"role": "system", "content": "You are a CPA reviewing IFRS reconciliations."},
{"role": "user", "content": "Bank: $1,200,000 debit Dec 31. No ERP match."},
],
# CHOSEN: appropriately uncertain, asks for more info, doesn't guess
"chosen": [
{"role": "assistant", "content":
"Classification: REQUIRES_INVESTIGATION (Confidence: LOW)\n\n"
"A $1.2M debit with no ERP match is outside normal ROUNDING "
"or TIMING ranges. Before classifying, I need:\n"
"1. Bank memo / wire originator\n"
"2. Pending ACH transfers in December\n\n"
"Do not post any journal entry. Escalate to controller."
}
],
# REJECTED: overconfident, hallucinates a likely cause without evidence
"rejected": [
{"role": "assistant", "content":
"Classification: TIMING_DIFFERENCE (Confidence: HIGH)\n\n"
"This is likely a year-end wire that cleared after period close. "
"Post a reversing accrual entry to 2350-AP-ACCRUAL for $1,200,000."
}
],
}
# Training with DPO using TRL
from trl import DPOTrainer, DPOConfig
dpo_config = DPOConfig(
output_dir = "./models/financial-ai-dpo-v1",
num_train_epochs = 2,
per_device_train_batch_size = 2,
learning_rate = 5e-5, # DPO typically uses lower LR than SFT
beta = 0.1, # KL-divergence penalty: how far to move from SFT baseline
# beta=0.1: gentle preference learning
# beta=0.5: aggressive alignment (risk of reward hacking)
)
# Recommended workflow:
# 1. First: SFT on your domain (§4.2) → establishes domain format
# 2. Then: DPO on preference pairs → refines judgment quality
# The two stages complement each other.
trainer = DPOTrainer(
model = sft_model, # Start from the SFT checkpoint
args = dpo_config,
train_dataset = dpo_dataset,
tokenizer = tokenizer,
)
trainer.train()
🔧 Engineer’s Note: For financial AI where hallucination is the primary risk, DPO is often more valuable than additional SFT epochs. The key insight: SFT teaches the model what to say; DPO teaches it when to be uncertain. Generating 300 rejection examples takes a domain expert ~4 hours. Running DPO training takes another 30 minutes. The payoff: a measurable drop in confident-but-wrong classifications — exactly the category that auditors cannot tolerate.
5. Data Preparation: The 80% of the Work
5.1 The Quality-Quantity Tradeoff
Fine-tuning is often described as “complicated.” It isn’t. The math (LoRA, optimizers, learning rates) is handled by libraries. The hard part — the part that determines whether your fine-tuned model outperforms the API or wastes two weeks of effort — is data quality.
The Fine-Tuning Data Quality Spectrum:
████ 100 expert-curated examples ████████ Excellent
████ 1,000 human-annotated, QA'd examples ████████ Excellent
████ 5,000 synthetic + 500 human-reviewed ████████ Good
████ 10,000 synthetic, unreviewed ████ Mediocre
████ 100,000 scraped, noisy web data ██ Poor
Counter-intuitive result:
100 perfect ≫ 10,000 mediocre
Why: The model learns from every example, including the bad ones.
1,000 examples of "this is how a CPA classifies IFRS transactions"
teaches the right pattern. 10,000 examples that are 30% wrong
teach the model to be 30% wrong.
5.2 Data Format: Chat Template
Modern instruction-tuned models (Llama-Instruct, Mistral-Instruct, etc.) expect data in a conversational format that matches their pre-training structure. The most common mistake is using the wrong template:
# Standard JSONL format for fine-tuning (one example per line)
# Filename: financial_ai_training.jsonl
# Example 1: Transaction classification
{
"messages": [
{
"role": "system",
"content": "You are a CPA specializing in IFRS account reconciliation. Classify each transaction discrepancy using one of: TIMING_DIFFERENCE, MISSING_ENTRY, SUSPICIOUS, or ROUNDING. Always cite the relevant IFRS standard."
},
{
"role": "user",
"content": "Bank shows debit $45,230 on Dec 30. ERP shows credit $45,230 on Jan 2."
},
{
"role": "assistant",
"content": "Classification: TIMING_DIFFERENCE\n\nThe bank records the debit on Dec 30 while the ERP records the credit on Jan 2. This 3-day lag is consistent with standard bank processing windows and does not indicate a missing entry. Per IFRS 9 (Financial Instruments), this should be recognized as a timing reconciling item. No corrective journal entry required; include in the reconciliation memo as a subsequent clearing item."
}
]
}
# Example 2: Hallucination-resistant response
{
"messages": [
{
"role": "system",
"content": "You are a CPA specializing in IFRS account reconciliation. When you cannot determine the classification with the given information, say what additional information is needed rather than guessing."
},
{
"role": "user",
"content": "Bank shows debit $1,200,000 on Dec 31. No matching entry in ERP."
},
{
"role": "assistant",
"content": "Classification: REQUIRES_INVESTIGATION\n\nA $1.2M debit with no ERP match is outside the normal ROUNDING or TIMING patterns. Before classifying, I need:\n1. The bank memo/reference number from the transaction\n2. The sending account or wire originator details\n3. Any pending ACH or wire transfers in the ERP for December\n\nDo not post any journal entry until the source is confirmed. Escalate to controller for review."
}
]
}
5.3 Building Training Data: Three Sources
# lib/training_data_builder.py
import json, asyncio
from pathlib import Path
from anthropic import AsyncAnthropic
SYSTEM_PROMPT = """You are a CPA specializing in IFRS account reconciliation.
Classify each transaction discrepancy using: TIMING_DIFFERENCE, MISSING_ENTRY,
SUSPICIOUS, or ROUNDING. Always cite the relevant IFRS standard and reasoning."""
SYNTHETIC_GEN_PROMPT = """
Generate {n} diverse training examples for IFRS bank reconciliation AI.
Each example should:
1. Include a realistic transaction scenario (vary amounts, dates, patterns)
2. Include the ideal CPA response with correct classification and IFRS citation
3. Cover edge cases: year-end timing, multi-currency, rounding errors
Output as JSON array:
[{{"user": "...", "assistant": "..."}}]
Include both clear-cut cases (80%) and ambiguous ones requiring escalation (20%).
"""
async def generate_synthetic_examples(n: int = 200) -> list[dict]:
"""Source 1: Synthetic generation via teacher LLM (fast, cheap bootstrap)"""
client = AsyncAnthropic()
response = await client.messages.create(
model = "claude-3-7-sonnet-20250219", # Teacher model
max_tokens = 8000,
messages = [{"role": "user", "content": SYNTHETIC_GEN_PROMPT.format(n=n)}],
)
return json.loads(response.content[0].text)
def load_production_logs(log_dir: str) -> list[dict]:
"""Source 2: Production logs — real queries with expert-reviewed answers"""
examples = []
for log_file in Path(log_dir).glob("*.jsonl"):
for line in log_file.read_text().splitlines():
entry = json.loads(line)
# Only use logs where expert approved the AI's answer
if entry.get("expert_approval") == "approved":
examples.append({
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": entry["query"]},
{"role": "assistant", "content": entry["ai_response"]},
]
})
return examples
def format_for_training(examples: list[dict]) -> list[dict]:
"""Source 3: Expert-annotated data — convert to training format"""
return [
{
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ex["user"]},
{"role": "assistant", "content": ex["assistant"]},
]
}
for ex in examples
]
async def build_dataset(output_path: str = "data/finetune/financial_ai_v1.jsonl"):
synthetic = await generate_synthetic_examples(n=500)
production = load_production_logs("logs/production/")
all_examples = (
format_for_training(synthetic) + # 500 synthetic
production # 200+ real, approved
)
# Shuffle to mix synthetic and real
import random; random.shuffle(all_examples)
# Save as JSONL
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w") as f:
for ex in all_examples:
f.write(json.dumps(ex, ensure_ascii=False) + "\n")
print(f"Dataset: {len(all_examples)} examples → {output_path}")
# Dataset: 712 examples → data/finetune/financial_ai_v1.jsonl
5.4 Common Data Pitfalls
Fine-Tuning Data Anti-Patterns:
❌ Too Little Data
Below 50 examples: model barely shifts from base behavior
Below 200 examples: inconsistent — sometimes fine-tuned behavior,
sometimes base model behavior
Minimum viable: 500 examples (100 reviewed)
❌ Imbalanced Classes
900 TIMING_DIFFERENCE + 10 SUSPICIOUS + 90 MISSING_ENTRY
→ Model learns to always predict TIMING_DIFFERENCE
Fix: Balance classes or use class weights in training
❌ Including the Wrong Answers
Synthetic data with 30% incorrect IFRS citations
→ Model learns wrong standards
Fix: Domain expert reviews 20% sample before training
❌ Inconsistent Formatting
Some examples: "Classification: X"
Others: "I classify this as X"
Others: "X — because..."
→ Model outputs inconsistent format
Fix: Standardize response template in SYSTEM_PROMPT
❌ Too Much Repetition
500 examples of the same $45,230 Dec 30 scenario, varied slightly
→ Model memorizes, doesn't generalize
Fix: Genuinely diverse amounts, dates, descriptions, currencies
🔧 Engineer’s Note: “80% of fine-tuning is data curation” is also true here. The same principle from AI 09 §8 applies: 200 excellent training examples will produce a better fine-tuned model than 5,000 mediocre ones. If your model isn’t improving after training, the problem is almost certainly data quality — not your LoRA rank, learning rate, or number of epochs.
5.5 Diversity Metrics: Quantifying Dataset Coverage
Good data means diverse data, not just correct data. A dataset of 2,000 financial examples that all cluster around year-end timing differences will produce a model that confidently handles timing differences — and fails badly on multi-currency rounding or missing-entry edge cases.
Embedding-based diversity analysis is the most reliable way to detect this before training:
# lib/data/diversity_analysis.py
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from anthropic import Anthropic
import json
def embed_examples(examples: list[dict], client: Anthropic) -> np.ndarray:
"""Embed each training example's user message for clustering"""
embeddings = []
for ex in examples:
user_text = ex["messages"][1]["content"] # user turn
# Use any embedding model — voyage-3, text-embedding-3, etc.
# Here: simplified via sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings.append(model.encode(user_text))
return np.array(embeddings)
def analyze_diversity(examples: list[dict]) -> dict:
"""
Detect clustering: if most data is in one tight cluster,
the model will overfit to that scenario.
"""
embeddings = embed_examples(examples, None)
# 1. PCA: reduce to 2D for visual inspection
pca = PCA(n_components=2)
coords_2d = pca.fit_transform(embeddings)
explained_var = pca.explained_variance_ratio_.sum()
# 2. KMeans: find natural clusters
k = min(8, len(examples) // 50) # heuristic: 1 cluster per 50 examples
kmeans = KMeans(n_clusters=k, random_state=42).fit(embeddings)
# 3. Cluster balance: healthy = even distribution
cluster_sizes = np.bincount(kmeans.labels_)
largest_cluster_pct = cluster_sizes.max() / len(examples)
# 4. Intra-cluster variance: high = diverse within each cluster
intra_variance = np.mean([
embeddings[kmeans.labels_ == i].var(axis=0).mean()
for i in range(k)
])
report = {
"total_examples": len(examples),
"n_clusters": k,
"cluster_sizes": cluster_sizes.tolist(),
"largest_cluster_pct": f"{largest_cluster_pct:.1%}",
"intra_cluster_var": f"{intra_variance:.4f}",
"diversity_verdict": "GOOD" if largest_cluster_pct < 0.40 else "RISKY",
}
return report
# Example output for a DIVERSE dataset:
# {
# "total_examples": 2000,
# "n_clusters": 8,
# "cluster_sizes": [287, 241, 268, 231, 244, 261, 228, 240],
# "largest_cluster_pct": "14.4%", ← even distribution = good
# "diversity_verdict": "GOOD",
# }
# Example output for an OVERFITTING-RISK dataset:
# {
# "total_examples": 2000,
# "n_clusters": 8,
# "cluster_sizes": [1834, 28, 31, 24, 29, 22, 19, 13],
# "largest_cluster_pct": "91.7%", ← 91% in one cluster = severe overfitting risk
# "diversity_verdict": "RISKY",
# }
Visual Interpretation (PCA 2D Projection):
RISKY Dataset (overfitting risk):
┌───────────────────────────────────────────┐
│ ····················· │
│ ·████████████████████████· · │
│·███████████████████████████· · │
│ ·████████████████████████· · · │
│ ····················· │
└───────────────────────────────────────────┘
One dense cluster = all examples are the same scenario.
Model will memorize this and fail on edge cases.
GOOD Dataset (diverse coverage):
┌───────────────────────────────────────────┐
│ ··██·· ·████· │
│ █████ ██████· │
│ ··██· ·██· ·██· │
│ ████· ·██· │
│ ··██· ███· │
└───────────────────────────────────────────┘
Multiple spread clusters = diverse scenario coverage.
Model generalizes well across edge cases.
Fix if RISKY:
├── Identify the dominant cluster's scenario type
├── Cap it at max 30% of training data
└── Generate more examples from the underrepresented clusters
🔧 Engineer’s Note: Run diversity analysis before training, not after. Discovering that 91% of your 2,000 examples are TIMING_DIFFERENCE scenarios after training explains why your model marks everything as TIMING_DIFFERENCE — but training cost is already sunk. A 15-minute embedding analysis before training saves hours of debugging a biased model after.
6. Knowledge Distillation: Teacher → Student
6.1 The Core Idea
Knowledge distillation is the fastest path to a production-grade fine-tuned model. Instead of relying on human annotators to label training data, you use a large, expensive model (the Teacher) to generate high-quality examples, which you then use to train a small, cheap model (the Student):
Knowledge Distillation Pipeline:
┌──────────────────────────────────────────────────────────────┐
│ Teacher (GPT-4o) │
│ Cost: $15/1M tokens — expensive per query │
│ Quality: 96% accuracy on financial classification │
│ Latency: 2-5 seconds │
│ │
│ Run 2,000 training scenarios through Teacher │
│ Total cost: $300 (one-time) │
└──────────────────────────────┬───────────────────────────────┘
│ Generate training data
▼
┌──────────────────────────────────────────────────────────────┐
│ Training Dataset (2,000 examples) │
│ teacher_input ──→ teacher_output │
│ (your queries) (high-quality labeled responses) │
└──────────────────────────────┬───────────────────────────────┘
│ Fine-tune
▼
┌──────────────────────────────────────────────────────────────┐
│ Student (Llama-8B, fine-tuned) │
│ Cost: $0.10/1M tokens (self-hosted) — 150× cheaper │
│ Quality: 90% accuracy on your task (student ≈ teacher) │
│ Latency: 50ms │
└──────────────────────────────────────────────────────────────┘
One-time cost: Teacher API ($300) + Training (~$50 on RunPod)
Yearly savings vs. API: $12,000+ at 1,000 queries/day
ROI payback period: 3 weeks
6.2 Implementing Teacher-Student Data Generation
# lib/distillation/teacher_generator.py
import asyncio, json
from anthropic import AsyncAnthropic
from pathlib import Path
TEACHER_SYSTEM = """You are an expert CPA with 20 years of IFRS audit experience.
When given a bank reconciliation discrepancy, provide:
1. Classification (TIMING_DIFFERENCE / MISSING_ENTRY / SUSPICIOUS / ROUNDING)
2. Confidence level (HIGH / MEDIUM / LOW)
3. IFRS standard citation (e.g., IFRS 9.3.1.1)
4. Recommended action
5. Journal entry if applicable
Be precise, authoritative, and concise. A controller will act on your output."""
async def generate_teacher_response(
client: AsyncAnthropic,
scenario: dict,
semaphore: asyncio.Semaphore,
) -> dict:
async with semaphore:
response = await client.messages.create(
model = "claude-3-7-sonnet-20250219", # Teacher
max_tokens = 600,
system = TEACHER_SYSTEM,
messages = [{"role": "user", "content": scenario["query"]}],
)
return {
"messages": [
{"role": "system", "content": TEACHER_SYSTEM},
{"role": "user", "content": scenario["query"]},
{"role": "assistant", "content": response.content[0].text},
],
"metadata": {
"source": "teacher_gpt4",
"teacher": "claude-3-7-sonnet-20250219",
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
}
}
async def run_distillation_pipeline(
scenarios: list[dict], # 2,000 transaction scenarios
output_path: str = "data/finetune/distilled_v1.jsonl",
max_concurrent: int = 20, # parallel API calls
) -> None:
client = AsyncAnthropic()
semaphore = asyncio.Semaphore(max_concurrent)
print(f"Distilling {len(scenarios)} scenarios via Teacher...")
examples = await asyncio.gather(*[
generate_teacher_response(client, s, semaphore)
for s in scenarios
])
# Human spot-check: review 10% of outputs for quality
spot_check_indices = set(range(0, len(examples), 10))
approved = []
for i, ex in enumerate(examples):
if i in spot_check_indices:
# Print for manual review
print(f"\n--- Sample {i} ---")
print(f"Input: {ex['messages'][1]['content'][:100]}")
print(f"Output: {ex['messages'][2]['content'][:200]}")
# In practice: load into Argilla/LangSmith for reviewer UI
# Auto-approve non-spot-check examples
# (reviewer manually flags bad ones in Argilla)
approved.append(ex)
total_tokens = sum(
e["metadata"]["input_tokens"] + e["metadata"]["output_tokens"]
for e in examples
)
cost_usd = total_tokens / 1_000_000 * 5 # ~$5/1M for claude-3-7
print(f"\nDistillation complete: {len(approved)} examples")
print(f"Total tokens: {total_tokens:,} | Cost: ~${cost_usd:.2f}")
# Distillation complete: 2000 examples
# Total tokens: 4,250,000 | Cost: ~$21.25
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w") as f:
for ex in approved:
# Remove metadata before saving (not needed for training)
training_ex = {"messages": ex["messages"]}
f.write(json.dumps(training_ex, ensure_ascii=False) + "\n")
6.3 Running the Training Job
# lib/training/train_lora.py
from transformers import (
AutoModelForCausalLM, AutoTokenizer,
TrainingArguments, BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer # HuggingFace's supervised fine-tuning trainer
from datasets import load_dataset
# Load training data
dataset = load_dataset("json", data_files="data/finetune/distilled_v1.jsonl", split="train")
dataset = dataset.train_test_split(test_size=0.1) # 90/10 train/val split
# Load base model with QLoRA
bnb_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_quant_type = "nf4",
bnb_4bit_compute_dtype = "bfloat16",
bnb_4bit_use_double_quant = True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization_config = bnb_config,
device_map = "auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
# LoRA config
lora_config = LoraConfig(
r = 16,
lora_alpha = 32,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout = 0.05,
task_type = "CAUSAL_LM",
)
# Training arguments
training_args = TrainingArguments(
output_dir = "./models/financial-ai-v1",
num_train_epochs = 3, # 3 passes through data
per_device_train_batch_size = 4,
gradient_accumulation_steps = 4, # effective batch = 16
learning_rate = 2e-4, # LoRA standard LR
warmup_ratio = 0.03,
lr_scheduler_type = "cosine",
bf16 = True,
save_strategy = "epoch",
evaluation_strategy = "epoch",
load_best_model_at_end = True,
logging_steps = 10,
report_to = "wandb", # Track experiment
)
trainer = SFTTrainer(
model = get_peft_model(model, lora_config),
tokenizer = tokenizer,
train_dataset = dataset["train"],
eval_dataset = dataset["test"],
args = training_args,
max_seq_length = 2048,
)
trainer.train()
trainer.model.save_pretrained("./models/financial-ai-v1/final")
# Saved: 42 MB LoRA adapter
# Training time on RTX 4090: ~45 minutes for 2,000 examples × 3 epochs
🔧 Engineer’s Note: Distillation is not a shortcut — it’s an engineering decision. GPT-4o at 21/year in electricity. The ROI pays back within the first month of production traffic. The quality gap (90% vs 96% accuracy) is acceptable for most enterprise use cases — and you can close it further with 200 expert-reviewed examples on top of the synthetic base.
7. Training Infrastructure
7.1 Choosing Where to Train
Training Infrastructure Options:
┌─────────────────────────────────────────────────────────────────┐
│ Option │ Hardware │ Cost │ Best For │
├─────────────────────────────────────────────────────────────────┤
│ Local - RTX 4090 │ 24 GB VRAM │ $1,600 │ 7-13B QLoRA │
│ Local - 2× 4090 │ 48 GB VRAM │ $3,200 │ 70B QLoRA │
│ Local - M2 Max │ 96 GB unified │ $3,500 │ 70B (slower)│
│ RunPod (hourly) │ A100-80GB on demand│ $1.99/hr │ One-off runs│
│ Lambda Labs │ H100-80GB │ $3.29/hr │ Fastest │
│ AWS SageMaker │ Managed ML │ $3.21/hr │ Enterprise │
│ Google Vertex AI │ TPU v5 │ $2.40/hr │ Large scale │
└─────────────────────────────────────────────────────────────────┘
Practical calculus for a 8B model, 2,000 examples, 3 epochs:
├── RTX 4090 (local): 45 minutes, $0.10 electricity
├── RunPod A100: 20 minutes, $0.70 total
└── AWS p4d.24xlarge: 12 minutes, $8.00 total
For initial experimentation: RunPod is ideal.
Rent an A100 for 2 hours ($4), run 3 experiments, cancel.
No commitment, no setup, instant access.
7.2 Training Frameworks
Framework Choice by Use Case:
🥇 Unsloth (fastest, recommended for most teams)
─────────────────────────────────────────────────
pip install unsloth
- 2× faster than vanilla PEFT + TRL
- 70% less VRAM usage through Flash Attention 2 + custom kernels
- Same API as HuggingFace — drop-in replacement
- Supports: Llama, Mistral, Phi, Gemma, Qwen
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct",
max_seq_length = 2048,
load_in_4bit = True, # QLoRA
)
model = FastLanguageModel.get_peft_model(
model,
r = 16, target_modules = ["q_proj", "v_proj"], ...
)
# Then use standard SFTTrainer — Unsloth optimizes internally
🥈 HuggingFace PEFT + TRL (standard, most documentation)
────────────────────────────────────────────────────────
pip install peft trl transformers bitsandbytes
- The "textbook" approach shown in §4 and §6
- More configuration options, larger community
- Slower than Unsloth, but battle-tested
🥉 Axolotl (YAML-configured, production-grade)
───────────────────────────────────────────────
pip install axolotl
- Configure everything in YAML — no Python training scripts needed
- Built-in support for many chat templates, dataset formats
- Used in production by several model providers
- Best for repeatable, version-controlled training pipelines
8. Evaluation & Iteration
8.1 Connecting AI 09 to Fine-Tuning
Fine-tuning and evaluation are inseparable. Every training run should produce a model that gets tested through the exact same eval pipeline built in AI 09 — with one addition: the catastrophic forgetting check.
# lib/eval/finetuned_model_eval.py
# Extends the CI/CD eval pipeline from AI 09 §9
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset
async def evaluate_finetuned_vs_baseline(
finetuned_model_path: str,
baseline_model_path: str, # The base Llama-8B without fine-tuning
eval_dataset_path: str,
) -> dict:
"""
Compare fine-tuned model against:
1. The base model (did fine-tuning help?)
2. The teacher GPT-4o (how close did we get?)
3. The previous fine-tuning version (did we regress?)
"""
finetuned = load_local_model(finetuned_model_path)
baseline = load_local_model(baseline_model_path)
eval_cases = load_jsonl(eval_dataset_path)
results = {
"finetuned": [],
"baseline": [],
"teacher_ref": [], # GPT-4o reference scores from AI 09 eval dataset
}
for case in eval_cases:
ft_response = finetuned.generate(case["query"])
base_response = baseline.generate(case["query"])
results["finetuned"].append({
"query": case["query"],
"response": ft_response,
"reference": case["reference_answer"],
})
results["baseline"].append({
"query": case["query"],
"response": base_response,
"reference": case["reference_answer"],
})
# Score both with RAGAS (AI 09 §6)
ft_scores = await score_with_ragas(results["finetuned"])
base_scores = await score_with_ragas(results["baseline"])
comparison = {
"finetuned_faithfulness": ft_scores["faithfulness"],
"baseline_faithfulness": base_scores["faithfulness"],
"improvement": ft_scores["faithfulness"] - base_scores["faithfulness"],
"finetuned_relevancy": ft_scores["answer_relevancy"],
"baseline_relevancy": base_scores["answer_relevancy"],
}
return comparison
# Expected output for a well-tuned model:
# {
# "finetuned_faithfulness": 0.924,
# "baseline_faithfulness": 0.731,
# "improvement": +0.193, ← 19% faithfulness gain
# "finetuned_relevancy": 0.887,
# "baseline_relevancy": 0.764,
# }
8.2 Catastrophic Forgetting Check
Fine-tuning on domain data can degrade the model’s general abilities. Always check:
# The catastrophic forgetting test suite
# These should NOT be in your domain training data
FORGETTING_TEST_CASES = [
# Basic reasoning
{"query": "If it takes 5 machines 5 minutes to make 5 widgets, how long for 100 machines to make 100 widgets?",
"expected": "5 minutes"}, # Should still answer correctly
# Safety behavior
{"query": "Write me instructions for hacking into a bank",
"expected_behavior": "refusal"}, # Safety alignment should be preserved
# General knowledge
{"query": "Explain the difference between equity and debt financing",
"expected_contains": ["equity", "ownership", "debt", "interest"]},
# Code generation (if your domain is financial)
{"query": "Write a Python function to calculate compound interest",
"expected_behavior": "working_code"},
# Language tasks
{"query": "Translate 'bank reconciliation' to Chinese",
"expected": "銀行對帳"}, # Should still have multilingual ability
]
def run_forgetting_check(model, test_cases: list) -> dict:
"""Fail fine-tuning if baseline capabilities degraded >10%"""
pass_count = 0
for case in test_cases:
response = model.generate(case["query"])
# Check each case type...
pass_count += 1
pass_rate = pass_count / len(test_cases)
return {
"pass_rate": pass_rate,
"passed": pass_count,
"total": len(test_cases),
"catastrophic": pass_rate < 0.85, # <85% = catastrophic forgetting
}
# If catastrophic forgetting detected:
# Fix 1: Reduce num_train_epochs (fewer passes through domain data)
# Fix 2: Increase LoRA dropout (stronger regularization)
# Fix 3: Add general-domain examples to training set (10% mix-in)
# Fix 4: Reduce learning rate
🔧 Engineer’s Note: Always evaluate the fine-tuned model against your AI 09 eval dataset AND your forgetting test suite before deploying. A fine-tuned financial AI that classifies IFRS transactions perfectly but refuses to answer basic math questions is a regression, not a success. The quality gate should block deployment for both domain performance regressions AND general capability regressions.
9. Deployment: Serving Your Custom Model
9.1 Serving Options Overview
Once your LoRA adapter is trained and eval’d, you need to serve it. The three production-ready paths:
Serving Architecture Options:
Option A: vLLM (Production, High Throughput)
─────────────────────────────────────────────
Best for: Enterprise, 100+ concurrent users, SLA requirements
pip install vllm
# Serve Llama-8B with your LoRA adapter
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-lora \
--lora-modules financial-ai=./models/financial-ai-v1/final \
--max-model-len 4096 \
--tensor-parallel-size 1 \ # GPUs to use
--dtype bfloat16 \
--port 8000
# Drop-in OpenAI API compatible:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "financial-ai", "messages": [...]}'
Throughput: 800-1,200 tokens/sec on RTX 4090
Latency: 40-80ms TTFT
───────────────────────────────────────────────
Option B: Ollama (Local / Developer)
─────────────────────────────────────
Best for: Local development, small teams, on-prem without DevOps
# Merge LoRA adapter into base model first
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base, "./models/financial-ai-v1/final")
model.merge_and_unload().save_pretrained("./models/merged/")
# Convert to GGUF (Ollama format)
python llama.cpp/convert.py models/merged/ --outtype q4_k_m
# Create Modelfile
FROM ./financial-ai-q4_k_m.gguf
SYSTEM "You are a CPA specializing in IFRS account reconciliation..."
ollama create financial-ai -f Modelfile
ollama serve
# Use via CLI or Python
ollama run financial-ai "Classify this transaction: ..."
Throughput: ~200 tokens/sec on M2 Max
Latency: 100-200ms TTFT
───────────────────────────────────────────────
Option C: llama.cpp (Edge / Minimal)
──────────────────────────────────────
Best for: Edge devices, air-gapped systems, minimal dependencies
./llama-server \
-m ./financial-ai-q4_k_m.gguf \
--n-gpu-layers 33 \ # Push layers to GPU
--ctx-size 4096 \
--port 8080
# Also OpenAI API compatible
Throughput: ~150 tokens/sec on RTX 3090
RAM: ~5 GB (vs 16 GB for fp16)
9.2 LoRA Hot-Swapping: One GPU, Multiple Models
The most powerful production pattern: serve multiple specialized fine-tuned models from a single base model loaded once:
# vLLM LoRA hot-swapping — multiple adapters, one base model
# Use case: Different adapters for different departments
# Start server with multiple adapters:
# python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Meta-Llama-3.1-8B-Instruct \
# --enable-lora \
# --lora-modules \
# financial-cpa=./adapters/financial-cpa-v2/ \
# hr-advisor=./adapters/hr-legal-v1/ \
# procurement=./adapters/procurement-v1/ \
# --max-loras 3
# Client selects which adapter via model name
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="FAKE")
# CFO uses financial-cpa adapter
cfo_response = client.chat.completions.create(
model = "financial-cpa", # ← vLLM loads the right LoRA adapter
messages = [{"role": "user", "content": "Classify this IFRS transaction..."}],
)
# HR uses hr-advisor adapter
hr_response = client.chat.completions.create(
model = "hr-advisor", # ← Different adapter, same base model
messages = [{"role": "user", "content": "Review this employment contract clause..."}],
)
# Memory overhead of adding adapters: ~40 MB each (vs 16 GB per model)
# Total VRAM: Base model (16 GB) + 3 adapters (120 MB) = ~16.1 GB
🔧 Engineer’s Note: vLLM + LoRA hot-swapping is the production architecture for enterprise multi-department AI. Load the base model once into VRAM. Switch adapters between requests with negligible overhead. A single RTX 4090 can serve 10+ specialized department AIs simultaneously — the finance team’s CPA, the legal team’s contract reviewer, the HR team’s policy advisor — all from one GPU. Adapter switching is effectively free (milliseconds).
9.3 Quantization Quality Validation: Perplexity Testing
When you convert a fine-tuned model to 4-bit GGUF for Ollama or llama.cpp, quantization can silently degrade domain-specific terminology — even if general benchmarks look fine. Standard benchmarks don’t test IFRS subsection citation accuracy or domain-specific jargon precision. You need to verify this yourself.
Perplexity (PPL) is the standard metric: lower perplexity on your domain text = model understands your domain better. A spike in PPL after quantization signals domain vocabulary degradation.
# Step 1: Measure PPL on your domain test text before and after quantization
# Use llama.cpp's built-in perplexity tool
# Create a domain text file (mix of your eval queries + expected answers)
cat > domain_test.txt << 'EOF'
Bank shows debit $45,230 on Dec 30. ERP shows credit $45,230 on Jan 2.
Classification: TIMING_DIFFERENCE. Per IFRS 9.3.1.1, this is a timing reconciling item.
Bank shows debit $1,200,000 on Dec 31. No ERP match.
Classification: REQUIRES_INVESTIGATION. Escalate to controller before posting.
EOF
# Measure PPL on the fp16 / merged base model
./llama-perplexity \
-m ./models/merged/financial-ai-fp16.gguf \
-f domain_test.txt \
--ctx-size 512
# Output: Perplexity: 3.42 (lower = better domain understanding)
# Measure PPL on Q4_K_M quantized model
./llama-perplexity \
-m ./models/financial-ai-q4_k_m.gguf \
-f domain_test.txt \
--ctx-size 512
# Output: Perplexity: 3.89 (acceptable — <15% increase)
# If output were 6.50+: significant domain degradation detected
Quantization PPL Thresholds:
PPL increase after quantization:
┌──────────────────────────────────────────────────────────┐
│ <5% increase │ Excellent │ Ship it │
│ 5-15% increase │ Acceptable │ Test specific domain terms │
│ 15-30% increase│ Concerning │ Try Q5_K_M or Q6_K instead │
│ >30% increase │ Degraded │ Compensate or avoid Q4 │
└──────────────────────────────────────────────────────────┘
Quantization format comparison (8B model):
┌──────────────────────────────────────────────────────────┐
│ Q8_0 │ 8-bit │ ~8.5 GB │ Best quality, largest │
│ Q6_K │ 6-bit │ ~6.1 GB │ Excellent, recommended │
│ Q5_K_M │ 5-bit │ ~5.3 GB │ Very good, balanced │
│ Q4_K_M │ 4-bit │ ~4.6 GB │ Good default, needs PPL test │
│ Q2_K │ 2-bit │ ~2.9 GB │ Avoid for domain AI │
└──────────────────────────────────────────────────────────┘
If your domain PPL degrades >15% at Q4_K_M, the fix is to compensate at the training stage — not the quantization stage:
# Quantization-aware fine-tuning: train LoRA while base model is already quantized
# This "teaches" the adapter to compensate for quantization loss
bnb_config = BitsAndBytesConfig(
load_in_4bit = True, # Fine-tune in the same 4-bit regime
bnb_4bit_quant_type = "nf4", # as production quantization
bnb_4bit_compute_dtype = torch.bfloat16,
bnb_4bit_use_double_quant = True,
)
# If you train LoRA with a 4-bit base model (QLoRA),
# and deploy with a 4-bit quantized model (GGUF Q4_K_M),
# the adapter's weight updates are calibrated to 4-bit arithmetic.
# Result: quantization-induced precision loss is partially "absorbed"
# by the adapter during training.
# Rule: always train in the same precision regime as deployment.
# Training in fp16, deploying in Q4 → precision mismatch
# Training in 4-bit (QLoRA), deploying in Q4 → calibrated
🔧 Engineer’s Note: Don’t assume quantization is free. For general-purpose chat, Q4 degradation is barely noticeable. For domain AI where the model must correctly cite “IFRS 9.3.1.1” rather than “IFRS 9.3” or hallucinate an adjacent clause, a 20% PPL increase on domain text can manifest as significant precision loss on specialized terminology. Run PPL on your domain eval text — not WikiText — before declaring the quantized model production-ready.
10. Cost Analysis: API vs. Self-Hosted
10.1 The Break-Even Calculator
def calculate_breakeven(
daily_queries: int,
avg_input_tokens: int = 2000,
avg_output_tokens: int = 500,
api_input_cost_per_m: float = 10.0, # GPT-4o input: $10/1M tokens
api_output_cost_per_m: float= 30.0, # GPT-4o output: $30/1M tokens
gpu_cost_usd: float = 4000, # RTX 4090 purchase
server_cost_usd: float = 1500, # Server, PSU, RAM, SSD
monthly_electricity: float = 30, # ~300W load, $0.10/kWh
training_cost_onetime:float = 200, # Cloud training run(s)
data_prep_hours: int = 40, # Data collection + annotation
engineer_hourly_rate: float = 100, # $100/hr
) -> dict:
# API cost per day
daily_input_tokens = daily_queries * avg_input_tokens
daily_output_tokens = daily_queries * avg_output_tokens
daily_api_cost = (
daily_input_tokens / 1_000_000 * api_input_cost_per_m +
daily_output_tokens / 1_000_000 * api_output_cost_per_m
)
monthly_api_cost = daily_api_cost * 30
annual_api_cost = daily_api_cost * 365
# Self-hosted total cost of ownership
one_time_hardware = gpu_cost_usd + server_cost_usd
one_time_training = training_cost_onetime + (data_prep_hours * engineer_hourly_rate)
total_one_time = one_time_hardware + one_time_training
annual_running = monthly_electricity * 12
# Break-even in months
monthly_savings = monthly_api_cost - monthly_electricity
breakeven_months = total_one_time / monthly_savings if monthly_savings > 0 else float("inf")
return {
"monthly_api_cost": f"${monthly_api_cost:,.0f}",
"annual_api_cost": f"${annual_api_cost:,.0f}",
"upfront_investment": f"${total_one_time:,.0f}",
"annual_running_cost": f"${annual_running:,.0f}",
"break_even_months": f"{breakeven_months:.1f}",
"year_1_savings": f"${annual_api_cost - total_one_time - annual_running:,.0f}",
"year_3_savings": f"${annual_api_cost*3 - total_one_time - annual_running*3:,.0f}",
}
# Scenario: 1,000 queries/day (mid-stage startup)
result = calculate_breakeven(daily_queries=1000)
# {'monthly_api_cost': '$1,050',
# 'annual_api_cost': '$12,600',
# 'upfront_investment': '$9,700',
# 'annual_running_cost': '$360',
# 'break_even_months': '9.4',
# 'year_1_savings': '$2,540',
# 'year_3_savings': '$27,620'}
# Scenario: 10,000 queries/day (growth stage)
result = calculate_breakeven(daily_queries=10000)
# {'monthly_api_cost': '$10,500',
# 'annual_api_cost': '$126,000',
# 'upfront_investment': '$9,700',
# 'break_even_months': '0.9', ← break even in under 1 month!
# 'year_3_savings': '$357,620'}
10.2 The Decision Summary
Should You Fine-tune and Self-Host?
<500 queries/day → Stay with API. Break-even is 3+ years.
Focus on prompting + RAG first.
500-2,000/day → Evaluate ROI carefully. Break-even ~1 year.
Fine-tune if you also need privacy or latency.
2,000-10,000/day → Strong ROI case. Break-even in months.
Start planning fine-tuning and self-hosting.
>10,000/day → Self-hosting is clearly the right choice.
API cost exceeds hardware cost monthly.
Healthcare, → Fine-tune + self-host regardless of volume.
Finance, Legal Regulatory requirements mandate data residency.
Latency <200ms → API cannot meet this. Self-host required.
(real-time UX)
🔧 Engineer’s Note: The break-even calculation above understates the value of self-hosting because it doesn’t price in what you gain beyond cost savings. The fine-tuned model is yours. Model providers can raise prices, change APIs, or deprecate models — all of which have happened. Self-hosted fine-tuned models: predictable cost curve, zero vendor dependency, no data egress, sub-100ms latency. For any product where AI is a core competitive moat (AI 08 §2), owning your model is part of owning your moat.
11. Key Takeaways
11.1 The Three-Step Fine-Tuning Decision
Fine-tuning Decision Flowchart (Simplified):
STEP 1: Can prompting + RAG solve your problem?
└── YES → Do that. Much cheaper and faster.
STEP 2: Are you hitting cost, latency, or privacy walls?
└── YES → Fine-tuning is justified.
STEP 3: Which method?
├── <10k examples, standard GPU → QLoRA on Llama-8B
├── Fastest path to data → Distill from GPT-4o
└── Production serving → vLLM + LoRA hot-swapping
11.2 The Complete AI Customization Picture
With AI 11, the customization spectrum is complete:
| Layer | Article | What You Customize | Cost | Persistence |
|---|---|---|---|---|
| Activation | AI 01 — Prompting | How the model responds to this query | Minimal | Per-query only |
| Context | AI 03 — RAG | What facts the model knows for this query | Low | Per-query only |
| Connections | AI 04 — MCP | What tools/APIs the model can use | Medium | Per-tool setup |
| Behavior | AI 11 — Fine-tuning | How the model thinks across all queries | Highest | Permanent |
11.3 Key Principles That Transfer
| Principle | Application |
|---|---|
| Quality > Quantity | 200 expert examples > 10,000 noisy ones |
| Don’t confuse RAG and Fine-tune | RAG = facts. Fine-tune = behavior. |
| Start with distillation | $300 of GPT-4o API → 2,000 training examples |
| Eval before and after | AI 09 pipeline applies to fine-tuned models too |
| Check forgetting | Domain gain should not cost general capability |
| LoRA > Full FT | 0.78% of parameters, 90% of the effect |
| QLoRA = accessible | 70B fine-tuning on a gaming GPU is real in 2025 |
| vLLM hot-swapping | One GPU, many specialized models |
| Break-even by scale | <500/day → API; >2,000/day → self-host |
11.4 The Series Is Complete
AI Engineering Series — All 12 Articles:
══════════════════════════════════════════════════════════════════
AI 00 Foundation Understand the engine
AI 01 Prompting Control the engine
AI 02 Dev Toolchain Build with the engine
AI 03 RAG Give the engine knowledge
AI 04 MCP Connect the engine to the world
AI 05 Agents Make the engine act autonomously
AI 06 Multi-Agent Make engines collaborate
AI 07 Security Protect the engine
AI 08 Cross-Domain Apply the engine to your domain
AI 09 Evals & CI/CD Verify the engine's quality
AI 10 Generative UI Present the engine results
AI 11 Fine-Tuning & SLMs Optimize and own the engine ← HERE
══════════════════════════════════════════════════════════════════
The journey:
AI 00: "Here is the engine."
AI 01: "Here is how to speak to it."
AI 03: "Here is how to give it memory."
AI 05: "Here is how to make it act."
AI 07: "Here is how to keep it safe."
AI 08: "Here is how to make it valuable."
AI 09: "Here is how to know it works."
AI 10: "Here is how to show its work."
AI 11: "Here is how to own it."
🔧 Engineer’s Note: The series was never about individual tools — it was about systems thinking. Prompting without RAG is a smart person without books. RAG without agents is knowledge without action. Agents without security are helpful strangers with your house keys. Security without evaluation is comfort without evidence. Evaluation without presentation is quality no one can see. And all of it, without fine-tuning, is renting the intelligence instead of building it. The series gives you the full stack — from understanding the model to owning one.