Chapter 30: Fine-Tuning for Infrastructure
The sovereign-analyst model from Chapter 29 knows how to format incident runbooks and reason about Proxmox CLI commands, but it has never encountered arc_miss_plus_io_saturated, does not know that pve7 through pve14 are in Rack C on leaf_c, and cannot produce a runbook specific to the compound condition zfs_resilvering_under_cpu_pressure as defined in the Chapter 22 alert_condition/4 registry. Quantized Low-Rank Adaptation (QLoRA) corrects this by training a small adapter on JSONL pairs derived directly from the WAM's own knowledge base — topology facts, alert schemas, and remediation procedures — and merging the result back into a deployable GGUF without the model or its training data leaving the cluster's physical hardware.
30.1 The Mathematics of QLoRA
30.1.1 Why Full Fine-Tuning Is Impossible at 16GB
Full fine-tuning of an 8-billion-parameter model requires keeping the model weights in memory simultaneously with the optimizer state. The Adam optimizer, standard for transformer fine-tuning, maintains two additional tensors per parameter: a first-moment estimate (mean of gradients) and a second-moment estimate (variance of gradients). Both are maintained in 32-bit floating point to preserve numerical stability, regardless of the precision used to store the model weights.
The VRAM arithmetic for full fine-tuning in BFloat16:
Base weights: 8B × 2 bytes (BF16) = 16.00 GB
Gradient buffer: 8B × 2 bytes (BF16) = 16.00 GB
Adam first moment: 8B × 4 bytes (FP32) = 32.00 GB
Adam second moment: 8B × 4 bytes (FP32) = 32.00 GB
Activation memory: forward pass (8K context) ≈ 6.00 GB
─────────────────────────────────────────────────────────
Total: ≈ 102.00 GB
The RTX 4080 Super has 16 GB. Full fine-tuning of this model requires approximately six such cards. Consumer hardware cannot participate.
30.1.2 QLoRA: Freezing Quantised Weights, Training Low-Rank Adapters
QLoRA (Quantized Low-Rank Adaptation) makes fine-tuning feasible on a single 16 GB GPU through two simultaneous techniques.
Quantised base weights. The 8B base model is loaded in 4-bit NormalFloat (NF4) format using the bitsandbytes library. NF4 is an information-theoretically optimal 4-bit quantisation for normally distributed weights, outperforming uniform 4-bit quantisation on downstream task quality. The base weights are frozen — their values are not updated during training. They occupy approximately 4.5 GB of VRAM and consume no gradient or optimizer memory.
Low-rank adapter matrices. For each selected weight matrix W₀ in the model (the attention projection matrices: q_proj, k_proj, v_proj, o_proj), QLoRA injects a parallel low-rank decomposition ΔW = B × A, where A ∈ ℝ^{r×d} and B ∈ ℝ^{d×r}. The rank r is a hyperparameter, typically 8 to 64. Only the matrices A and B have trainable parameters. The forward pass computes h = W₀x + α/r × BAx, where α (lora_alpha) is a scaling factor. The base model output and the adapter output are summed.
The adapter parameter count for rank r = 16 across all four attention projections in Llama-3-8B (32 layers, hidden dimension 4096):
Per projection per layer: r × d + d × r = 2 × r × d
= 2 × 16 × 4096 = 131,072 parameters
Four projections × 32 layers = 4 × 32 × 131,072 = 16,777,216
Total adapter parameters: ~16.8M (vs 8B base = 0.21% of base size)
The revised VRAM arithmetic:
Frozen base weights (NF4): 8B × 0.5 bytes = 4.50 GB
Adapter weights (BF16): 16.8M × 2 bytes = 0.03 GB
Adapter gradients (BF16): 16.8M × 2 bytes = 0.03 GB
Adam first moment (FP32): 16.8M × 4 bytes = 0.07 GB
Adam second moment (FP32): 16.8M × 4 bytes = 0.07 GB
Activation memory (8K ctx): ≈ 5.00 GB
─────────────────────────────────────────────────────────────
Total: ≈ 9.70 GB
9.7 GB against a 16 GB VRAM budget leaves 6.3 GB of headroom — sufficient to run a batch size of 2 with gradient accumulation, without activating CUDA's unified memory fallback.
30.1.3 Hyperparameters: Rank, Alpha, and Dropout
Rank (r). The rank of the adapter matrices determines their expressiveness. r = 8 is the minimum that allows the adapter to model non-trivial weight updates; r = 64 approximates full fine-tuning quality at the cost of higher VRAM and training time. For domain adaptation — teaching the model a specific vocabulary and factual schema without changing its fundamental reasoning capability — r = 16 is the standard choice. Higher ranks risk overfitting to a small dataset; lower ranks may fail to capture domain-specific distinctions.
Alpha (lora_alpha). The alpha controls the effective learning rate of the adapter relative to the frozen base weights. The convention is lora_alpha = 2 × r, which sets the scaling factor α/r = 2. A higher ratio amplifies adapter updates; a lower ratio keeps them conservative. For fine-tuning on a small infrastructure dataset (hundreds to low thousands of examples), lora_alpha = 32 with r = 16 is appropriate: the adapter learns the domain vocabulary without overwriting the base model's grammatical and logical structure.
Dropout. The lora_dropout parameter applies dropout to the adapter activations during training, preventing the adapter from memorising individual training examples. lora_dropout = 0.05 (5% dropout) is appropriate for datasets below 5,000 examples. Higher dropout is counterproductive at small dataset sizes because the signal-to-noise ratio is already low; lower dropout risks rote memorisation of the training JSONL rather than generalisation of the domain schema.
30.2 The Build: Prolog to JSONL Dataset Curation
30.2.1 Extraction Architecture
The fine-tuning dataset must encode the precise factual relationships that the base model lacks: which alert conditions exist in alert_dispatcher.pl, what their compound logic means operationally, which nodes are in which racks, and what the correct expert runbook looks like for each alert. The ground truth for all of this lives in the WAM. The extraction script queries the WAM directly, ensuring the training data is byte-identical to the logic the cluster actually runs.
The output format is ChatML JSONL — one JSON object per line, each containing a messages array with system, user, and assistant roles. This is the instruction-tuning format expected by trl's SFTTrainer and the format the sovereign-analyst Modelfile uses at inference time, ensuring no format mismatch between training and serving.
30.2.2 export_training_data.pl
% File: /opt/logic-node/kb/export_training_data.pl
%
% Generates ChatML-formatted JSONL training pairs from the WAM knowledge base.
% Each pair teaches the model a specific alert schema, its Prolog ground truth,
% and the canonical expert runbook response.
%
% Usage:
% swipl -l alert_dispatcher.pl \
% -l proxmox_topology.pl \
% -l live_state.pl \
% -l export_training_data.pl \
% -g "export_training_data:run('/tmp/sovereign_training.jsonl'), halt"
:- module(export_training_data, [run/1]).
:- use_module(library(http/json)).
:- use_module(library(lists)).
:- use_module(alert_dispatcher).
:- use_module(proxmox_topology).
% ── System prompt ─────────────────────────────────────────────────────────────
% Identical to the Modelfile system prompt in Chapter 29 §29.3.4.
% Must be kept in sync: the training system prompt and the inference system
% prompt must match exactly or the model will encounter a domain shift between
% training and serving.
system_prompt_atom('You are a Proxmox infrastructure incident analyst \c
operating in a sovereign, air-gapped environment. You will be given \c
structured alerts from an automated monitoring system. Produce a concise \c
operational runbook for the on-call engineer.\n\n\c
Format your response as:\n\c
## Probable Cause\n\c
One to three sentences describing the most likely root cause.\n\n\c
## Immediate Actions\n\c
Numbered list of specific Proxmox diagnostic commands to run first.\n\n\c
## Escalation Criteria\n\c
Conditions under which this incident requires escalation beyond the runbook.\n\n\c
Use exact Proxmox CLI commands (pvesh, qm, pct, zpool) where applicable. \c
Do not fabricate metric values. Do not suggest actions that modify cluster \c
state autonomously.').
% ── Alert-condition training pairs ────────────────────────────────────────────
% alert_runbook(+ConditionID, +NodeExample, -UserTurn, -AssistantTurn)
% Generates the user prompt and expert assistant response for one
% alert_condition/4 fact. NodeExample is a representative node atom
% from proxmox_topology:known_node/1 used to ground the example.
alert_training_pair(CondID, Node, UserTurn, AssistantTurn) :-
alert_condition(CondID, Severity, Description, _Goal),
proxmox_topology:known_node(Node),
format(atom(UserTurn),
'ALERT\n\
Node: ~w\n\
Condition: ~w\n\
Severity: ~w\n\
Detail: ~w\n\
Timestamp: 2026-03-10T03:12:44Z',
[Node, CondID, Severity, Description]),
runbook_response(CondID, Node, AssistantTurn).
% runbook_response(+ConditionID, +Node, -Response)
% Expert runbook for each registered alert_condition.
% These responses are authored by the operator and encode institutional
% knowledge — the specific CLI commands and thresholds that apply to
% this cluster's topology and SLAs.
runbook_response(cpu_steal_critical, Node, Response) :-
format(atom(Response),
'## Probable Cause\n\
CPU steal >= 40% on ~w indicates the hypervisor is sharing a physical CPU with \c
other VMs or host processes and being denied CPU time. Primary causes are NUMA \c
imbalance, an overcommitted physical host, or a runaway VM consuming all vCPU slots.\n\n\
## Immediate Actions\n\
1. Check steal per-vCPU: `pvesh get /nodes/~w/vms --output-format json | jq .[].cpus`\n\
2. List CPU usage by VM: `qm list` followed by `qm monitor <vmid>` → `info cpuload`\n\
3. Check host steal directly: `vmstat 1 10` — look for non-zero `st` column\n\
4. Check NUMA locality: `numactl --hardware` on host ~w\n\
5. Identify top CPU consumer: `ps aux --sort=-%cpu | head -20` on host ~w\n\n\
## Escalation Criteria\n\
Escalate if steal remains >= 40% after VM eviction or if the steal source is \c
the Proxmox hypervisor process itself (indicates kernel scheduling bug or \c
hardware fault). Engage the Chapter 27 HA scheduler if the affected node is \c
sole host for any HA-tagged VM.',
[Node, Node, Node, Node]).
runbook_response(disk_latency_critical, Node, Response) :-
format(atom(Response),
'## Probable Cause\n\
Disk read latency >= 5ms on ~w indicates a hardware fault on the NVMe device, \c
an active ZFS resilver competing for I/O bandwidth, or a filesystem-level \c
bottleneck. At this threshold, VM disk I/O is already user-visible.\n\n\
## Immediate Actions\n\
1. Check NVMe health: `nvme smart-log /dev/nvme0 | grep -E "critical|error|wear"`\n\
2. Check ZFS pool status: `zpool status -v` on host ~w — look for resilver or DEGRADED\n\
3. Check I/O latency distribution: `zpool iostat -v 1 5` on host ~w\n\
4. Check current I/O waiters: `iostat -x 1 5 | grep -v ^$`\n\
5. Check ZFS ARC pressure: `arcstat 1 5` — low arc_hit_percent indicates thrashing\n\n\
## Escalation Criteria\n\
Escalate immediately if `zpool status` shows DEGRADED, FAULTED, or REMOVED. \c
Initiate Chapter 27 live migration of VMs off ~w if resilver is active and \c
latency exceeds 10ms. Replace NVMe if SMART indicates uncorrectable errors.',
[Node, Node, Node, Node]).
runbook_response(io_saturated_critical, Node, Response) :-
format(atom(Response),
'## Probable Cause\n\
NVMe I/O utilisation >= 95% on ~w means the device is at saturation and \c
request queuing is active. Causes: a single VM issuing burst random writes, \c
ZFS compression overhead on incompressible data, or snapshot-induced copy-on-write storms.\n\n\
## Immediate Actions\n\
1. Identify top I/O VM: `pvesh get /nodes/~w/vms | jq -r .[].diskwrite` — sort descending\n\
2. Check queue depth: `cat /sys/block/nvme0n1/queue/nr_requests`\n\
3. Check ARC eviction: `arcstat -f hit,miss,arcsz 1 5` — high miss rate → ARC undersized\n\
4. Check snapshot list: `zfs list -t snapshot | wc -l` — excessive snapshots cause COW pressure\n\
5. Throttle VM: `qm set <vmid> --ide0 ...,mbps_rd=200,mbps_wr=200`\n\n\
## Escalation Criteria\n\
Escalate if utilisation remains >= 95% after I/O throttle. Consider ZFS record \c
size tuning (64K for databases, 1M for bulk storage) to reduce amplification. \c
If NVMe shows write endurance < 10%%, plan hardware replacement.',
[Node, Node]).
runbook_response(cpu_steal_plus_disk_degraded, Node, Response) :-
format(atom(Response),
'## Probable Cause\n\
Concurrent CPU steal >= 10% and disk latency >= 0.5ms on ~w indicate resource \c
contention where I/O completion is blocked by CPU scheduling delays. The SCSI \c
completion interrupt handler cannot run because the vCPU is stolen; disk \c
transactions queue behind their own completions.\n\n\
## Immediate Actions\n\
1. Correlate steal with latency timeline: `vmstat 1 30` — look for steal spikes \c
immediately preceding latency spikes\n\
2. Check VM vCPU pinning: `qm config <vmid> | grep cpu` — unpinned vCPUs are \c
vulnerable to steal\n\
3. Set I/O priority on critical VMs: `ionice -c 1 -n 0 -p $(pgrep qemu)`\n\
4. Check host process interference: `perf top -a` on host ~w for 30 seconds\n\
5. Evacuate lowest-priority VMs: consult Chapter 26 baseline scheduler for target\n\n\
## Escalation Criteria\n\
Escalate if steal + latency persist after VM evacuation. Compound condition \c
indicates the host is overcommitted at both CPU and I/O simultaneously — \c
a rack-level capacity failure, not a single-VM issue.',
[Node, Node]).
runbook_response(arc_miss_plus_io_saturated, Node, Response) :-
format(atom(Response),
'## Probable Cause\n\
ZFS ARC miss rate >= 20%% concurrent with I/O utilisation >= 70%% on ~w indicates \c
the ARC working set has been evicted — the pool is reading from disk for data \c
that was recently cached. Primary cause is ARC size reduction under memory \c
pressure from VM balloon expansion, or a workload shift to a dataset too large \c
for the current ARC allocation.\n\n\
## Immediate Actions\n\
1. Check ARC size: `cat /proc/spl/kstat/zfs/arcstats | grep -E "^c |^size|^min|^max"`\n\
2. Check ARC target: `arcstat -f arcsz,c,hit,miss 1 5`\n\
3. Check memory pressure: `free -h` on host ~w — is ARC being compressed?\n\
4. Temporarily increase ARC minimum: \c
`echo <bytes> > /sys/module/zfs/parameters/zfs_arc_min`\n\
5. Check which pool is missing: `zpool iostat -v | sort -k5 -rn | head -5`\n\n\
## Escalation Criteria\n\
Escalate if ARC miss rate does not recover within 10 minutes of increasing \c
arc_min. If memory is genuinely exhausted (free < 2GB), the Chapter 26 \c
scheduler must evict VMs from ~w before ZFS performance degrades further.',
[Node, Node, Node]).
runbook_response(zfs_resilvering_under_cpu_pressure, Node, Response) :-
format(atom(Response),
'## Probable Cause\n\
ZFS resilver (arc_miss >= 20%%) concurrent with CPU steal >= 10%% on ~w is \c
the highest-priority compound condition in the registry. The resilver is \c
reading from the mirror partner across the storage network; that I/O is \c
generating completion interrupts that cannot be processed because vCPUs are \c
stolen. The resilver will take significantly longer than its ETA estimate, \c
leaving the pool DEGRADED for an extended window.\n\n\
## Immediate Actions\n\
1. Confirm resilver active: `zpool status -v | grep -A5 resilver`\n\
2. Check resilver progress: `zpool status | grep scan` — note elapsed and remaining\n\
3. Throttle resilver to reduce I/O pressure: \c
`echo 10 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms` (default 27)\n\
4. Address CPU steal: evacuate lowest-priority VMs off ~w via Chapter 27 scheduler\n\
5. Monitor resilver ETA hourly: `watch -n 3600 ''zpool status | grep scan''`\n\
6. Do not reboot ~w during resilver — an interrupted resilver restarts from zero\n\n\
## Escalation Criteria\n\
This is already a critical compound condition. Escalate immediately if a second \c
disk fault is detected on the same pool (pool would enter FAULTED state with no \c
redundancy). Page the on-call storage engineer if resilver ETA exceeds 4 hours \c
while CPU steal remains active.',
[Node, Node, Node]).
% ── Topology training pairs ───────────────────────────────────────────────────
% topology_training_pair(-UserTurn, -AssistantTurn)
% Generates one training pair per rack teaching the model the
% cluster's physical layout.
topology_training_pair(UserTurn, AssistantTurn) :-
member(Rack-Nodes, [
rack_a - [pve1, pve2, pve3],
rack_b - [pve4, pve5, pve6],
rack_c - [pve7, pve8, pve9, pve10, pve11, pve12, pve13, pve14]
]),
format(atom(UserTurn),
'Which physical rack contains node ~w and what is the full node list for that rack?',
[Nodes]), % We use first member as example
Nodes = [FirstNode|_],
format(atom(UserTurn),
'Which physical rack contains node ~w, and what are all the nodes in that rack?',
[FirstNode]),
atomic_list_concat(Nodes, ', ', NodeList),
format(atom(AssistantTurn),
'Node ~w is in ~w. The complete node list for ~w is: ~w. \c
This rack constitutes one independent failure domain — a power or \c
top-of-rack switch failure affects only these nodes. The Chapter 27 \c
HA scheduler ensures no HA group places two replicas within the same rack.',
[FirstNode, Rack, Rack, NodeList]).
% ── JSONL output ──────────────────────────────────────────────────────────────
% write_jsonl_pair(+Stream, +SystemPrompt, +UserTurn, +AssistantTurn)
% Serialises one ChatML training example to Stream in JSONL format.
write_jsonl_pair(Stream, SysPrompt, User, Assistant) :-
Messages = [
json([role = system, content = SysPrompt]),
json([role = user, content = User]),
json([role = assistant, content = Assistant])
],
Obj = json([messages = Messages]),
with_output_to(string(Line), json_write(current_output, Obj, [width(0)])),
writeln(Stream, Line).
% ── Entry point ───────────────────────────────────────────────────────────────
run(OutputPath) :-
system_prompt_atom(SysPrompt),
setup_call_cleanup(
open(OutputPath, write, Stream),
(
% Alert-condition pairs: one example per (condition, node) pair.
% Sample two representative nodes per condition to create variety
% without exhausting every permutation.
findall(CondID, alert_condition(CondID, _, _, _), CondIDs),
findall(Node, proxmox_topology:known_node(Node), AllNodes),
length(AllNodes, NNodes),
SampleSize is min(3, NNodes),
length(SampleNodes, SampleSize),
append(SampleNodes, _, AllNodes), % take first SampleSize nodes
forall(
( member(CondID, CondIDs),
member(Node, SampleNodes),
alert_training_pair(CondID, Node, User, Assistant)
),
write_jsonl_pair(Stream, SysPrompt, User, Assistant)
),
% Topology pairs: one per rack.
forall(
topology_training_pair(User2, Asst2),
write_jsonl_pair(Stream, SysPrompt, User2, Asst2)
),
% Count written pairs:
aggregate_all(count,
alert_condition(_, _, _, _),
NConditions),
TotalPairs is NConditions * SampleSize + 3,
format("Wrote ~w training pairs to ~w~n", [TotalPairs, OutputPath])
),
close(Stream)
).
30.2.3 Running the Export
root@logic-node-01:~# swipl \
-l /opt/logic-node/kb/alert_dispatcher.pl \
-l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/live_state.pl \
-l /opt/logic-node/kb/export_training_data.pl \
-g "export_training_data:run('/tmp/sovereign_training.jsonl'), halt"
Wrote 21 training pairs to /tmp/sovereign_training.jsonl
# Transfer to the AI VM (VLAN 40, air-gap preserved — no internet transit):
root@logic-node-01:~# rsync -avz --progress \
/tmp/sovereign_training.jsonl \
[email protected]:/opt/training/sovereign_training.jsonl
# Verify line count (one JSON object per line):
root@logic-node-01:~# wc -l /tmp/sovereign_training.jsonl
21 /tmp/sovereign_training.jsonl
# Verify a sample record is valid JSON:
root@logic-node-01:~# head -1 /tmp/sovereign_training.jsonl | python3 -m json.tool | head -8
{
"messages": [
{"role": "system", "content": "You are a Proxmox infrastructure..."},
{"role": "user", "content": "ALERT\nNode: pve1\nCondition: cpu_steal_critical..."},
{"role": "assistant", "content": "## Probable Cause\nCPU steal >= 40%..."}
]
}
30.3 The Build: The Python Fine-Tuning Pipeline
30.3.1 Environment Setup
All Python fine-tuning runs inside the AI VM (ai-inference-01, VMID 200, Ollama suspended for training). The RTX 4080 Super is exclusively available.
# Inside ai-inference-01:
root@ai-inference-01:~# ollama stop # suspend inference during training
root@ai-inference-01:~# pip install \
transformers==4.41.0 \
peft==0.11.1 \
trl==0.8.6 \
bitsandbytes==0.43.1 \
datasets==2.19.1 \
accelerate==0.30.0 \
torch==2.3.0 \
--extra-index-url https://download.pytorch.org/whl/cu121
# Verify GPU is visible:
root@ai-inference-01:~# python3 -c "import torch; print(torch.cuda.get_device_name(0))"
NVIDIA GeForce RTX 4080 SUPER
30.3.2 train_sovereign_lora.py
#!/usr/bin/env python3
# File: /opt/training/train_sovereign_lora.py
#
# QLoRA fine-tuning of Llama-3-8B-Instruct on the sovereign infrastructure
# dataset generated by export_training_data.pl.
#
# Base model: Meta-Llama-3-8B-Instruct (HuggingFace format, downloaded offline)
# Dataset: /opt/training/sovereign_training.jsonl
# Output: /opt/training/sovereign-lora-adapter/
#
# Requires CUDA 12.1+, ~10GB VRAM, 16GB+ system RAM for HF model loading.
import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
# ── Paths ─────────────────────────────────────────────────────────────────────
BASE_MODEL_PATH = "/opt/training/models/Meta-Llama-3-8B-Instruct"
DATASET_PATH = "/opt/training/sovereign_training.jsonl"
OUTPUT_DIR = "/opt/training/sovereign-lora-adapter"
FINAL_GGUF_DIR = "/opt/training/merged"
# ── 4-bit quantisation configuration ─────────────────────────────────────────
# NF4: Normal Float 4-bit — information-theoretically optimal for
# normally distributed model weights (see §30.1.2).
# double_quant: quantise the quantisation constants themselves — saves
# ~0.4 GB additional VRAM with negligible quality loss.
# compute_dtype: BF16 provides wider dynamic range than FP16 for activations;
# critical for numerical stability during backward pass through frozen layers.
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
# ── LoRA adapter configuration ────────────────────────────────────────────────
# target_modules: the four attention projections in each transformer block.
# q_proj: query projection — learns which tokens to attend to
# k_proj: key projection — learns what to be attended to
# v_proj: value projection — learns what information to extract
# o_proj: output projection — learns how to combine head outputs
# These four matrices account for the majority of the model's world-model
# encoding. Adapting them teaches the model new factual associations
# (alert names → runbook steps, rack IDs → node lists) without touching
# the feed-forward layers that encode grammatical and syntactic structure.
#
# r=16: rank — see §30.1.3
# lora_alpha=32: effective LR scaling = lora_alpha/r = 2.0
# lora_dropout=0.05: 5% dropout on adapter activations (small dataset guard)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
bias="none",
inference_mode=False,
)
# ── Load base model ───────────────────────────────────────────────────────────
print("[train] Loading base model in 4-bit NF4...")
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL_PATH,
quantization_config=bnb_config,
device_map="auto", # places all layers on cuda:0 (RTX 4080 Super)
trust_remote_code=False, # never execute remote code from model repo
torch_dtype=torch.bfloat16,
)
model.config.use_cache = False # required for gradient checkpointing
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH, trust_remote_code=False)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # left-padded sequences cause NaN in BF16
# Inject LoRA adapter into the loaded model:
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 16,777,216 || all params: 8,047,337,472 || trainable%: 0.2085
# ── Dataset ───────────────────────────────────────────────────────────────────
# Load as HuggingFace Dataset from the JSONL file:
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
def format_chatml(example):
"""Convert messages array to a single ChatML-formatted string."""
parts = []
for msg in example["messages"]:
role = msg["role"]
content = msg["content"]
parts.append(f"<|im_start|>{role}\n{content}<|im_end|>")
parts.append("<|im_start|>assistant\n") # sentinel for completion-only loss
return {"text": "\n".join(parts)}
dataset = dataset.map(format_chatml, remove_columns=["messages"])
# Completion-only collator: computes loss only on assistant turns,
# not on system/user turns. Prevents the model from "learning" to
# generate system prompts or user alerts — it learns only to produce
# expert assistant responses given the structured input.
response_template = "<|im_start|>assistant\n"
collator = DataCollatorForCompletionOnlyLM(
response_template=response_template,
tokenizer=tokenizer,
)
# ── Training arguments ────────────────────────────────────────────────────────
# Single-GPU configuration for RTX 4080 Super (16GB VRAM).
#
# per_device_train_batch_size=1: actual micro-batch processed per step.
# With 21 training examples and batch_size=1, one epoch = 21 steps.
# gradient_accumulation_steps=4: effective batch size = 1 × 4 = 4.
# Accumulated over 4 micro-steps before a weight update, simulating
# a batch of 4 without requiring 4x VRAM.
# max_steps=100: for a 21-example dataset, 100 steps ≈ 4.7 epochs.
# Sufficient for domain vocabulary acquisition without severe overfitting.
# learning_rate=2e-4: standard for LoRA fine-tuning. Higher than full
# fine-tuning rates (typically 1e-5) because only the adapter trains;
# the base weights are frozen and cannot be destabilised.
# warmup_ratio=0.03: 3% of max_steps (3 steps) of linear LR warmup.
# Prevents instability on the first gradient steps.
# lr_scheduler_type="cosine": cosine annealing to zero over max_steps.
# Reduces the effective LR gradually as training progresses, smoothing
# convergence on a small dataset.
# fp16=False, bf16=True: BF16 activations match the NF4 compute dtype.
# FP16 risks overflow on large activation values; BF16 does not.
# gradient_checkpointing=True: trades compute for VRAM — recomputes
# activations during backward pass rather than storing them.
# Reduces activation memory by ~60% at a 30% training speed cost.
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
max_steps=100,
learning_rate=2e-4,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
fp16=False,
bf16=True,
logging_steps=10,
save_steps=50,
save_total_limit=2,
gradient_checkpointing=True,
optim="paged_adamw_8bit", # 8-bit Adam reduces optimizer VRAM by ~60%
dataloader_num_workers=0, # inside VM; fork is unreliable with CUDA
report_to="none", # no wandb/mlflow — air-gapped environment
remove_unused_columns=False,
)
# ── Trainer ───────────────────────────────────────────────────────────────────
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
data_collator=collator,
args=training_args,
dataset_text_field="text",
max_seq_length=2048, # covers the longest alert + runbook pair
packing=False, # do not pack multiple examples; completions
# must stay aligned with response_template
)
print("[train] Starting QLoRA fine-tuning...")
trainer.train()
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"[train] Adapter saved to {OUTPUT_DIR}")
30.3.3 Monitoring Training
root@ai-inference-01:~# python3 /opt/training/train_sovereign_lora.py 2>&1 | tee /tmp/train.log
[train] Loading base model in 4-bit NF4...
trainable params: 16,777,216 || all params: 8,047,337,472 || trainable%: 0.2085
{'loss': 1.9821, 'learning_rate': 0.000193, 'epoch': 1.90, 'step': 10}
{'loss': 1.3047, 'learning_rate': 0.000178, 'epoch': 3.81, 'step': 20}
{'loss': 0.8914, 'learning_rate': 0.000152, 'epoch': 5.71, 'step': 30}
{'loss': 0.5523, 'learning_rate': 0.000117, 'epoch': 7.62, 'step': 40}
{'loss': 0.3841, 'learning_rate': 0.000078, 'epoch': 9.52, 'step': 50}
{'loss': 0.2934, 'learning_rate': 0.000043, 'epoch': 11.43, 'step': 60}
{'loss': 0.2401, 'learning_rate': 0.000016, 'epoch': 13.33, 'step': 70}
{'loss': 0.2178, 'learning_rate': 0.000004, 'epoch': 15.24, 'step': 80}
{'loss': 0.2103, 'learning_rate': 0.000001, 'epoch': 17.14, 'step': 90}
{'loss': 0.2089, 'learning_rate': 0.000000, 'epoch': 19.05, 'step': 100}
[train] Adapter saved to /opt/training/sovereign-lora-adapter
Loss convergence from 1.98 to 0.21 over 100 steps indicates the adapter has learned the domain vocabulary. Loss below 0.3 on a 21-example dataset typically indicates memorisation of the specific training examples — acceptable here because the training examples are the ground truth of the system, not a sample from a larger distribution.
30.4 Merging and Serving the Adapter
30.4.1 Merge: LoRA Adapter into BF16 Safetensors
The fine-tuning produced a LoRA adapter stored as PyTorch safetensors in /opt/training/sovereign-lora-adapter/. This directory contains only the adapter matrices (adapter_model.safetensors) and config files — not the full model. Before conversion to GGUF, the adapter must be mathematically merged back into a full-precision copy of the base model: W_merged = W₀ + α/r × B × A.
# Step 1: Merge adapter into full BF16 HuggingFace model.
# This produces the standard HuggingFace model directory layout
# that convert_hf_to_gguf.py expects.
root@ai-inference-01:~# python3 - << 'PYEOF'
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch
BASE = "/opt/training/models/Meta-Llama-3-8B-Instruct"
ADAPTER = "/opt/training/sovereign-lora-adapter"
MERGED = "/opt/training/merged/sovereign-analyst-v2-hf"
print("Loading PEFT model...")
model = AutoPeftModelForCausalLM.from_pretrained(
ADAPTER,
torch_dtype=torch.bfloat16,
device_map="cpu", # merge on CPU — avoids VRAM limit during merge
)
print("Merging adapter into base weights...")
model = model.merge_and_unload() # W_merged = W0 + (alpha/r) * B @ A
model.save_pretrained(MERGED, safe_serialization=True)
AutoTokenizer.from_pretrained(BASE).save_pretrained(MERGED)
print(f"Merged model saved to {MERGED}")
PYEOF
Loading PEFT model...
Merging adapter into base weights...
Merged model saved to /opt/training/merged/sovereign-analyst-v2-hf
30.4.2 Quantise to Q4_K_M GGUF
The merged BF16 HuggingFace model is now converted to GGUF and quantised to Q4_K_M using llama.cpp's conversion utilities. The llama.cpp installation on ai-inference-01 is the same binary used by the Chapter 28 llm-inference.service.
# Step 2: Convert HuggingFace model to unquantised GGUF (F16 intermediate).
root@ai-inference-01:~# python3 /opt/llm-inference/src/llama.cpp/convert_hf_to_gguf.py \
/opt/training/merged/sovereign-analyst-v2-hf \
--outtype f16 \
--outfile /opt/training/merged/sovereign-analyst-v2-f16.gguf
INFO:hf-to-gguf:Model successfully exported to /opt/training/merged/sovereign-analyst-v2-f16.gguf
# Step 3: Quantise F16 GGUF to Q4_K_M (same quantisation as the base model).
# llama-quantize is the llama.cpp quantisation binary.
root@ai-inference-01:~# /opt/llm-inference/bin/llama-quantize \
/opt/training/merged/sovereign-analyst-v2-f16.gguf \
/opt/ollama/models/blobs/sovereign-analyst-v2-Q4_K_M.gguf \
Q4_K_M
[ 1/ 291] blk.0.attn_norm.weight - [ 4096], type = f32, size = 0.016 MB
[ 2/ 291] blk.0.ffn_down.weight - [ 4096, 14336], type = f16, converting to q4_K ...
...
[291/ 291] output_norm.weight - [ 4096], type = f32, size = 0.016 MB
llama_model_quantize_internal: model size = 14985.72 MB
llama_model_quantize_internal: quant size = 4686.73 MB
# Step 4: Verify the output file:
root@ai-inference-01:~# ls -lh /opt/ollama/models/blobs/sovereign-analyst-v2-Q4_K_M.gguf
-rw-r--r-- 1 root root 4.6G sovereign-analyst-v2-Q4_K_M.gguf
# Compute SHA-256 and make the hash file immutable (§28.5.1 pattern):
root@ai-inference-01:~# sha256sum \
/opt/ollama/models/blobs/sovereign-analyst-v2-Q4_K_M.gguf \
> /opt/ollama/models/blobs/sovereign-analyst-v2-Q4_K_M.gguf.sha256
root@ai-inference-01:~# chattr +i \
/opt/ollama/models/blobs/sovereign-analyst-v2-Q4_K_M.gguf.sha256
30.4.3 Create the Ollama Model Alias and Verify
# Step 5: Create a new Modelfile pointing at the fine-tuned GGUF.
# The system prompt is identical — the adapter has been trained to comply with it.
root@ai-inference-01:~# cat > /opt/ollama/Modelfile-v2 << 'EOF'
FROM /opt/ollama/models/blobs/sovereign-analyst-v2-Q4_K_M.gguf
SYSTEM """You are a Proxmox infrastructure incident analyst operating in a \
sovereign, air-gapped environment. You will be given structured alerts from \
an automated monitoring system. Produce a concise operational runbook for \
the on-call engineer.
Format your response as:
## Probable Cause
One to three sentences describing the most likely root cause.
## Immediate Actions
Numbered list of specific Proxmox diagnostic commands to run first.
## Escalation Criteria
Conditions under which this incident requires escalation beyond the runbook.
Use exact Proxmox CLI commands (pvesh, qm, pct, zpool) where applicable. \
Do not fabricate metric values. Do not suggest actions that modify cluster \
state autonomously."""
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER num_predict 600
EOF
root@ai-inference-01:~# ollama create sovereign-analyst-v2 -f /opt/ollama/Modelfile-v2
transferring model data ✓
creating model layer ✓
writing manifest ✓
success
# Smoke test: query the fine-tuned model on a known alert condition:
root@ai-inference-01:~# ollama run sovereign-analyst-v2 \
"ALERT
Node: pve7
Condition: zfs_resilvering_under_cpu_pressure
Severity: critical
Detail: ZFS resilvering (arc_miss >= 20%) concurrent with CPU steal >= 10%
Timestamp: 2026-03-10T03:12:44Z"
## Probable Cause
Node pve7 is undergoing ZFS resilver while simultaneously experiencing CPU steal
above 10%. The resilver I/O generates completion interrupts that cannot be
processed because vCPUs are stolen by the hypervisor scheduler...
## Immediate Actions
1. Confirm resilver active: `zpool status -v | grep -A5 resilver`
2. Check resilver progress: `zpool status | grep scan`
3. Throttle resilver: `echo 10 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms`
4. Evacuate lowest-priority VMs off pve7 via Chapter 27 scheduler...
The fine-tuned model correctly identifies pve7 as a Rack C node and produces the exact diagnostic sequence from the training data — including the zfs_resilver_min_time_ms kernel parameter that the base model would not have known to suggest.
30.5 Sovereign Security: Data Localisation and IP Leakage
30.5.1 What Cloud Fine-Tuning Transfers
To fine-tune a model using a cloud provider's API — OpenAI fine-tuning, Anthropic's model customisation, or any SaaS training platform — the training dataset must be uploaded to the provider's infrastructure. For the sovereign cluster, the sovereign_training.jsonl file contains the following categories of corporate intellectual property:
The complete alert_condition/4 registry encodes the company's operational thresholds — the exact percentages at which cpu_steal, arc_miss_rate, and disk_latency trigger alerts. These thresholds represent accumulated operational knowledge: the company has learned, through incidents, that its workloads fail at these specific values. An attacker who knows these thresholds knows exactly how hard to push the cluster before an alert fires.
The proxmox_topology facts encode the physical layout of the cluster: which nodes exist, which racks they occupy, and how they are networked. This is a partial network map. Combined with the IP schema from VLAN configurations (192.168.100.0/24 management, 10.40.0.0/24 metrics), it constitutes a target list for a network adversary.
The runbook responses encode the exact CLI commands and kernel parameters the company's operators use to respond to incidents — including the sequence in which they run them and the thresholds they consider escalation-worthy. This is a comprehensive attack-surface enumeration: it tells an adversary exactly what the company's automated defences will do, and in what order, when under attack.
30.5.2 The IP Forfeiture Mechanism
Uploading this dataset to a cloud provider does not merely expose the data to the provider's staff — it incorporates the company's operational intelligence into the provider's model training infrastructure. Most cloud fine-tuning agreements allow the provider to use uploaded data for model improvement, abuse detection, and safety evaluation. The terms vary, but the fundamental mechanism is the same: the moment the JSONL leaves the cluster's hardware, the company loses sole custody of the information it contains.
The intelligence extracted from the company's incidents — the correlation between arc_miss_plus_io_saturated and ZFS ARC working-set eviction, the specific zfs_resilver_min_time_ms parameter that throttles resilver I/O pressure — is now resident in a third-party system. The company does not know who has read it, which systems have processed it, or whether it will appear in a future model's training corpus.
30.5.3 Local QLoRA as a Physical Custody Guarantee
The QLoRA pipeline described here never transmits a byte of training data outside the cluster's hardware boundary. The base model weights are downloaded once via a trusted channel and stored on the air-gapped AI VM. The JSONL is generated by the WAM and transferred to the AI VM via the internal VLAN 40 network — no internet transit. The fine-tuning runs on the RTX 4080 Super inside the VM. The adapter weights and the merged GGUF are stored on local ZFS storage.
The resulting sovereign-analyst-v2.gguf is a model that contains the company's operational intelligence baked into its weights — but those weights are physically located on the company's hardware. The intelligence cannot be extracted from the weights by querying the model (the adapter does not reproduce training verbatim; it generalises the schema). The weights cannot be accessed by the cloud provider (there is no cloud provider). The training data cannot be subpoenaed from a third party (no third party has ever seen it).
This is the physical definition of data sovereignty applied to machine learning: the model's knowledge of the infrastructure is derived entirely from the infrastructure's own knowledge base, trained on hardware the company owns, and served from hardware the company controls. The Prolog WAM that generated the training data and the LLM that consumes the inference requests are both running on the same air-gapped subnet — the intelligence loop is closed within the physical perimeter.