Skip to main content

Chapter 29: Proxmox PCIe Passthrough and AI Inference

Running the LLM inference engine directly on the Proxmox host OS — as Chapter 27's llm-inference.service did — conflates the hypervisor's privilege domain with the GPU workload's privilege domain: a CUDA driver panic or an out-of-memory kernel event in the GPU process stack can destabilise the host kernel that manages all fourteen hypervisor VMs. This chapter moves the inference workload into a dedicated Debian VM with raw, exclusive access to the RTX 4080 Super via IOMMU-enforced PCIe passthrough, deploys Ollama as the API-first inference runtime inside the VM, and builds a production-grade Go streaming client on logic-node-01 that replaces the Chapter 27 LLMClient with one that speaks Ollama's native protocol and handles VM reboots without blocking the orchestrator's alert pipeline.


29.1 IOMMU and VFIO Isolation Mechanics

29.1.1 IOMMU Groups and the Contamination Problem

The IOMMU (Input-Output Memory Management Unit) is the hardware mechanism that translates DMA addresses from PCIe devices into physical memory addresses. Without an IOMMU, a PCIe device can be programmed to read or write any physical memory address on the host — there is no hardware boundary between device address space and host RAM. With an IOMMU, each device's DMA transactions are filtered through a remapping table that constrains them to a specific set of physical pages. A device that attempts to access a physical address outside its permitted range generates an IOMMU fault and the transaction is aborted.

Proxmox PCIe passthrough uses this mechanism to hand a device — in this case, the RTX 4080 Super — to a specific VM. The IOMMU maps the VM's guest-physical address space to the host physical pages allocated for that VM, and the GPU's DMA transactions are constrained to those pages. The hypervisor's memory, other VMs' memory, and the host OS's kernel memory are all outside the GPU's permitted DMA range.

The complication is IOMMU groups. The IOMMU does not track individual devices — it tracks groups of devices that share a PCIe ACS (Access Control Services) domain. If the RTX 4080 Super and its associated HDMI audio controller (10de:0bfa) are in the same IOMMU group as a USB controller that the Proxmox host is using, they cannot be passed through to a VM independently without potentially granting the VM DMA access to the USB controller's memory range as well. The entire IOMMU group must be assigned to the VM, or no device in the group can be passed through.

%%{init: {"themeVariables": {"fontSize": "14px"}}}%%
flowchart LR
    GPU["RTX 4080 Super
10de:2704
PCIe Bus Master
DMA address issuer"]
    AUD["NVIDIA Audio
10de:0bfa
Same IOMMU Group 15
Co-assigned to VM"]
    IOMMU["IOMMU Hardware
Second-level page table
VM guest-phys → host-phys
Managed by Proxmox KVM only"]
    VMRAM["AI VM — VMID 200
16 GB allocated pages
GPU DMA: permitted
Kernel pages: permitted"]
    HOSTRAM["Host RAM
Proxmox kernel
Other VM pages
GPU DMA: FAULT → abort
Kernel pages: FAULT → abort"]
    FAULT["IOMMU Fault
DMAR interrupt raised
Transaction aborted
Host kernel log: DMAR fault"]

    GPU -->|"DMA request
(guest-physical addr)"| IOMMU
    AUD -->|"DMA request"| IOMMU
    IOMMU -->|"addr in VM's
remapping table?"| VMRAM
    IOMMU -->|"addr outside
VM's table"| FAULT
    FAULT -.->|"never reaches"| HOSTRAM

    style GPU fill:#1A2B4A,color:#FFFFFF
    style AUD fill:#1A2B4A,color:#FFFFFF
    style IOMMU fill:#4A3A1A,color:#FFFFFF
    style VMRAM fill:#1A6B3A,color:#FFFFFF
    style HOSTRAM fill:#6B1A1A,color:#FFFFFF
    style FAULT fill:#6B1A1A,color:#FFFFFF

On modern consumer motherboards with PCIe 4.0 or 5.0, discrete GPU slots typically land in their own IOMMU group. Verify this before configuring passthrough:

# On the Proxmox host (pve1, which carries the RTX 4080 Super):
root@pve1:~# for iommu_group in /sys/kernel/iommu_groups/*/devices/*; do
    echo "Group $(basename $(dirname $(dirname $iommu_group))): $(lspci -nns ${iommu_group##*/})"
done | grep -E "NVIDIA|VGA|Audio|10de"
Group 15: 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD103 [GeForce RTX 4080 SUPER] [10de:2704] (rev a1)
Group 15: 01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0bfa] (rev a1)

Both the GPU (10de:2704) and its audio function (10de:0bfa) are in Group 15 alone — no other devices share the group. This is the clean case: both functions are assigned to the AI VM together.

If the GPU shares a group with other devices the host needs, the correct fix is to enable PCIe ACS override. This is a kernel patch (pcie_acs_override=downstream,multifunction) that forces each PCIe function into its own IOMMU group. It has security implications detailed in §29.5 — use it only if the GPU is genuinely co-grouped with essential host devices.

29.1.2 Enabling IOMMU on the Proxmox Host

# /etc/default/grub on pve1 (the host carrying the RTX 4080 Super)
# Edit GRUB_CMDLINE_LINUX_DEFAULT to add IOMMU parameters.
root@pve1:~# nano /etc/default/grub
# /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Proxmox Virtual Environment"

# For Intel CPUs:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

# For AMD CPUs, replace the above with:
# GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"

# iommu=pt (passthrough mode): only devices assigned to VMs are remapped.
# Devices retained by the host bypass the IOMMU remapping table, reducing
# host memory access latency for non-passthrough devices. Without pt, every
# host DMA transaction incurs IOMMU translation overhead — measurable on
# NVMe storage and 10GbE NICs.
GRUB_CMDLINE_LINUX=""
GRUB_TERMINAL=console
root@pve1:~# update-grub
root@pve1:~# reboot

# After reboot — verify IOMMU is active:
root@pve1:~# dmesg | grep -E "IOMMU|iommu" | head -5
[    0.000000] DMAR: IOMMU enabled
[    0.000000] DMAR-IR: x2apic is not enabled by BIOS. Please enable x2apic in BIOS for better performance
[    1.234567] pci 0000:00:00.0: Adding to iommu group 0
[    1.234589] pci 0000:01:00.0: Adding to iommu group 15
[    1.234591] pci 0000:01:00.1: Adding to iommu group 15

29.1.3 VFIO Driver Stubbing

The VFIO (Virtual Function I/O) kernel driver is the userspace-accessible passthrough mechanism. For PCIe passthrough to work, the GPU must be bound to the vfio-pci driver — not to the NVIDIA driver — before the Proxmox VM starts. If the NVIDIA driver claims the GPU first, vfio-pci cannot bind to it without an explicit unbind/rebind cycle.

The correct approach is to pre-bind the GPU to vfio-pci at boot via the kernel module's ids parameter, before nvidia loads:

# Load the VFIO modules on boot:
root@pve1:~# nano /etc/modules-load.d/vfio.conf
# /etc/modules-load.d/vfio.conf
vfio
vfio_iommu_type1
vfio_pci
root@pve1:~# nano /etc/modprobe.d/vfio.conf
# /etc/modprobe.d/vfio.conf
# Bind the RTX 4080 Super GPU and its audio function to vfio-pci at boot.
# ids= takes a comma-separated list of vendor:device pairs.
# 10de:2704 = AD103 [GeForce RTX 4080 SUPER]
# 10de:0bfa = NVIDIA audio controller (HDMI audio, co-grouped)
options vfio-pci ids=10de:2704,10de:0bfa

# Prevent the NVIDIA driver from loading on this host.
# The GPU is exclusively owned by the AI VM — the host never needs it.
# blacklist nvidia disables the NVIDIA kernel module on pve1.
# This is intentional: pve1 is a hypervisor host, not a workstation.
blacklist nouveau
blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_modeset
root@pve1:~# update-initramfs -u -k all
root@pve1:~# reboot

# After reboot — verify vfio-pci owns the GPU:
root@pve1:~# lspci -nnk -d 10de:2704
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD103 [GeForce RTX 4080 SUPER] [10de:2704] (rev a1)
        Subsystem: ASUSTeK Computer Inc. Device [1043:8898]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Kernel driver in use: vfio-pci confirms the GPU is held by the VFIO driver and is not accessible to the host OS. The NVIDIA driver is listed under Kernel modules only — it is available but not loaded.


29.2 The Build: Provisioning the AI VM via CLI

29.2.1 VM Configuration

The AI VM requires three non-standard parameters beyond a typical Debian VM: the q35 machine type (which provides a PCIe root complex rather than the older i440fx ISA bus, required for proper PCIe semantics), OVMF UEFI firmware (required for GPU UEFI support — the RTX 4080 Super has a UEFI-only vBIOS), and the hostpci0 and hostpci1 assignments for the GPU and audio functions respectively.

# On pve1. VMID 200 is reserved for the AI VM in this deployment.
# Adjust storage target (local-lvm) and VMID as needed for your cluster.
root@pve1:~# VMID=200

# ── Step 1: Create the base VM ──────────────────────────────────────────────
root@pve1:~# qm create ${VMID} \
    --name ai-inference-01 \
    --description "Debian AI VM — RTX 4080 Super passthrough, Ollama inference" \
    --machine q35 \
    --bios ovmf \
    --cpu host \
    --sockets 1 \
    --cores 8 \
    --memory 16384 \
    --balloon 0 \
    --numa 1 \
    --hugepages any

# --machine q35:   PCIe root complex; required for GPU passthrough
# --bios ovmf:     UEFI firmware; required for RTX UEFI vBIOS initialisation
# --cpu host:      expose host CPU features to the VM; needed for AVX2/AVX-512
#                  used by llama.cpp's tokenisation and sampling paths
# --balloon 0:     disable memory ballooning; the VM holds fixed 16GB allocation
#                  to prevent hypervisor from reclaiming RAM during inference
# --numa 1:        enable NUMA topology passthrough. QEMU/KVM exposes the
#                  host's NUMA topology to the guest. The vCPUs and RAM are
#                  then pinned (next step below) to the CPU socket whose
#                  PCIe root complex directly services the x16 slot holding
#                  the RTX 4080 Super. Without NUMA alignment, an LLM prompt
#                  token that traverses from the GPU (Socket 0 PCIe) through
#                  the PCIe root complex crosses the QPI/Infinity Fabric
#                  interconnect to reach RAM allocated on Socket 1, adding
#                  50–100ns of cross-socket latency per cache line fetch.
#                  At the memory-bandwidth scale of a 4.5GB model load,
#                  this accumulates to measurable time-to-first-token
#                  degradation. NUMA alignment keeps GPU ↔ RAM traffic local.
# --hugepages any: back the VM's 16GB allocation with 2MB transparent huge
#                  pages rather than 4KB small pages. The host kernel cannot
#                  fragment huge pages under memory pressure — the 16GB
#                  allocation is physically contiguous from the VM's
#                  perspective. This eliminates the micro-stutters that occur
#                  when QEMU's memory balloon driver or the host's kswapd
#                  splits a 4KB page that happens to be in the middle of a
#                  CPU-to-GPU DMA transfer's scatter-gather list.

# ── Step 2: Allocate UEFI and disk storage ──────────────────────────────────
root@pve1:~# qm set ${VMID} \
    --efidisk0 local-lvm:1,efitype=4m,pre-enrolled-keys=0

root@pve1:~# qm set ${VMID} \
    --scsi0 local-lvm:64,discard=on,iothread=1,ssd=1 \
    --scsihw virtio-scsi-single

# 64GB system disk — sufficient for Debian + CUDA toolkit + model files
# iothread=1: dedicated I/O thread for the disk controller; prevents
#             disk I/O from contending with inference CPU threads

# ── Step 3: Attach the GPU (both IOMMU group members) ───────────────────────
# hostpci0: GPU function — pcie=1 enables PCIe native semantics (BAR mapping,
#           MSI-X interrupts); x-vga=1 marks this as the primary VGA adapter
#           so the VM's UEFI can initialise the GPU for early-boot display.
root@pve1:~# qm set ${VMID} \
    --hostpci0 0000:01:00.0,pcie=1,x-vga=1

# hostpci1: Audio function — same IOMMU group, must be passed through together.
# No x-vga=1 here; this is not a display adapter.
root@pve1:~# qm set ${VMID} \
    --hostpci1 0000:01:00.1,pcie=1

# ── Step 4: Network — dual interface ────────────────────────────────────────
# net0: Management interface (VLAN 10 / 192.168.100.0/24)
#       Used for SSH administration and for the Go orchestrator to reach
#       the Ollama API on 192.168.100.30:11434 (fallback path)
root@pve1:~# qm set ${VMID} \
    --net0 virtio,bridge=vmbr0,tag=10

# net1: Metrics/inference VLAN 40 (10.40.0.0/24)
#       Primary path for Ollama API calls from logic-node-01.
#       OLLAMA_HOST will bind to 10.40.0.50 on this interface.
root@pve1:~# qm set ${VMID} \
    --net1 virtio,bridge=vmbr0,tag=40

# ── Step 5: Boot media and options ──────────────────────────────────────────
root@pve1:~# qm set ${VMID} \
    --ide2 local:iso/debian-12.9.0-amd64-netinst.iso,media=cdrom \
    --boot order=ide2

# ── Step 6: Display and agent ───────────────────────────────────────────────
# Serial console for headless operation after initial install:
root@pve1:~# qm set ${VMID} \
    --serial0 socket \
    --vga serial0 \
    --agent enabled=1,fstrim_cloned_disks=1

# ── Step 7: NUMA-aligned CPU pinning ────────────────────────────────────────
# Determine which physical cores belong to Socket 0 (the socket whose PCIe
# root complex controls PCIe slot 1, i.e., the x16 slot with the RTX 4080 Super):
root@pve1:~# lscpu --extended | grep -E "^[0-9]" | awk '{print $1, $3, $4}' | head -20
# Output format: CPU  NODE  SOCKET
# Identify which CPU IDs are on SOCKET 0. Example: CPUs 0-7 and 16-23 on Socket 0.

# Pin all 8 vCPUs to Socket 0 physical cores (adjust range per lscpu output):
root@pve1:~# qm set ${VMID} --affinity "0-7"
# This ensures all VM vCPU threads are scheduled exclusively on Socket 0
# physical cores, co-located with the PCIe root complex serving the GPU.

root@pve1:~# qm config ${VMID}
agent: enabled=1,fstrim_cloned_disks=1
balloon: 0
bios: ovmf
boot: order=ide2
cores: 8
cpu: host
efidisk0: local-lvm:vm-200-disk-0,efitype=4m,pre-enrolled-keys=0,size=4M
hostpci0: 0000:01:00.0,pcie=1,x-vga=1
hostpci1: 0000:01:00.1,pcie=1
ide2: local:iso/debian-12.9.0-amd64-netinst.iso,media=cdrom
machine: pc-q35-9.0
memory: 16384
name: ai-inference-01
net0: virtio=AA:BB:CC:DD:EE:01,bridge=vmbr0,tag=10
net1: virtio=AA:BB:CC:DD:EE:02,bridge=vmbr0,tag=40
numa: 0
ostype: l26
scsi0: local-lvm:vm-200-disk-1,discard=on,iothread=1,size=64G,ssd=1
scsihw: virtio-scsi-single
serial0: socket
sockets: 1
vga: serial0

29.2.2 Post-Install Configuration

After Debian installation and first boot, install the NVIDIA proprietary driver and CUDA toolkit inside the VM:

# Inside the AI VM (ai-inference-01):
root@ai-inference-01:~# apt-get update && apt-get install -y \
    linux-headers-amd64 \
    software-properties-common \
    curl wget gpg

# Add the NVIDIA CUDA repository for Debian 12:
root@ai-inference-01:~# curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/3bf863cc.pub \
    | gpg --dearmor -o /usr/share/keyrings/nvidia-cuda.gpg

root@ai-inference-01:~# echo "deb [signed-by=/usr/share/keyrings/nvidia-cuda.gpg] \
    https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/ /" \
    > /etc/apt/sources.list.d/nvidia-cuda.list

root@ai-inference-01:~# apt-get update && apt-get install -y \
    cuda-drivers-555 \
    cuda-toolkit-12-5

root@ai-inference-01:~# reboot

# After reboot — verify GPU is visible inside the VM:
root@ai-inference-01:~# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02    CUDA Version: 12.5       |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080 SUPER  Off |   00000000:01:00.0 Off |                  N/A |
| 30%   38C    P8              18W / 320W |       1MiB / 16376MiB  |      0%      Default |
+-----------------------------------------------------------------------------------------+

The GPU is visible at 00000000:01:00.0 inside the VM — the same PCIe address it held on the host, because passthrough presents the physical device directly. 16376MiB total memory is the full 16 GB minus the VBIOS reservation, confirming no host-side VRAM consumption.


29.3 The Build: Ollama and the System Prompt

29.3.1 Ollama Installation

Ollama is an inference runtime that manages GGUF model files, handles CUDA initialisation, and exposes an OpenAI-compatible HTTP API. Unlike llama.cpp invoked directly, Ollama persists the model in VRAM between requests — eliminating the 10–30 second model-load latency on every inference call.

# Inside the AI VM:
root@ai-inference-01:~# curl -fsSL https://ollama.com/install.sh | sh

# The install script creates /usr/local/bin/ollama and a systemd service.
# Do not start it yet — configure the network binding first.
root@ai-inference-01:~# systemctl stop ollama 2>/dev/null; true

29.3.2 Ollama Systemd Drop-In

The default Ollama service binds to 127.0.0.1:11434. The Go orchestrator on logic-node-01 reaches the AI VM over VLAN 40. A systemd drop-in overrides the bind address without modifying the upstream service unit file — a standard pattern that survives ollama package updates:

root@ai-inference-01:~# mkdir -p /etc/systemd/system/ollama.service.d
root@ai-inference-01:~# nano /etc/systemd/system/ollama.service.d/network.conf
# /etc/systemd/system/ollama.service.d/network.conf
#
# Overrides the default Ollama service to:
#   1. Bind to the VLAN 40 interface (10.40.0.50) for orchestrator access
#   2. Store model files on the larger local storage partition
#   3. Enable mlock to pin model weights in RAM during inference
#   4. Enforce GPU memory constraints

[Service]
# OLLAMA_HOST: bind to the VLAN 40 IP. The Go orchestrator on logic-node-01
# reaches this at http://10.40.0.50:11434/api/chat.
Environment="OLLAMA_HOST=10.40.0.50:11434"

# OLLAMA_MODELS: model storage directory. The 64GB system disk provides
# ample space; place models on a dedicated path for clarity.
Environment="OLLAMA_MODELS=/opt/ollama/models"

# OLLAMA_NUM_PARALLEL: one inference request at a time.
# Prevents concurrent VRAM allocation from exceeding 16GB.
Environment="OLLAMA_NUM_PARALLEL=1"

# OLLAMA_KEEP_ALIVE: how long to keep the model loaded in VRAM after the
# last request. -1 means never unload — the model stays warm for the
# duration of the service's lifetime. This eliminates load latency for
# the next alert, which may arrive seconds after the first.
Environment="OLLAMA_KEEP_ALIVE=-1"

# OLLAMA_FLASH_ATTENTION: enable Flash Attention for faster inference.
# Reduces memory bandwidth consumption for long contexts.
Environment="OLLAMA_FLASH_ATTENTION=1"
root@ai-inference-01:~# mkdir -p /opt/ollama/models
root@ai-inference-01:~# chown ollama:ollama /opt/ollama/models

root@ai-inference-01:~# systemctl daemon-reload
root@ai-inference-01:~# systemctl enable --now ollama

root@ai-inference-01:~# systemctl status ollama
● ollama.service - Ollama Service
     Loaded: loaded (/lib/systemd/system/ollama.service; enabled)
    Drop-In: /etc/systemd/system/ollama.service.d
             └─network.conf
     Active: active (running) ...

29.3.3 Model Pull and VRAM Verification

# Pull the same model family used in Chapter 27:
root@ai-inference-01:~# ollama pull llama3.1:8b-instruct-q4_K_M

# Verify VRAM allocation after model loads:
root@ai-inference-01:~# nvidia-smi --query-gpu=memory.used,memory.free \
    --format=csv,noheader,nounits
5961, 10415   # 5.8 GB used — consistent with Chapter 27 §27.1.2 prediction

29.3.4 Defining the Model with the System Prompt

Ollama's Modelfile mechanism bakes the system prompt into a named model, ensuring the advisory persona is enforced at the engine level rather than relying on every Go caller to supply it correctly:

root@ai-inference-01:~# cat > /opt/ollama/Modelfile << 'EOF'
FROM llama3.1:8b-instruct-q4_K_M

# System prompt baked into the model alias.
# This cannot be overridden by user-turn injection — the system role
# is set at model initialisation, prior to any user content processing.
SYSTEM """You are a Proxmox infrastructure incident analyst operating in a \
sovereign, air-gapped environment. You will be given structured alerts from \
an automated monitoring system. Produce a concise operational runbook for \
the on-call engineer.

Format your response as:
## Probable Cause
One to three sentences describing the most likely root cause.

## Immediate Actions
Numbered list of specific Proxmox diagnostic commands to run first.

## Escalation Criteria
Conditions under which this incident requires escalation beyond the runbook.

Use exact Proxmox CLI commands (pvesh, qm, pct, zpool) where applicable. \
Do not fabricate metric values. Do not suggest actions that modify cluster \
state autonomously."""

PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER num_predict 600
EOF

root@ai-inference-01:~# ollama create sovereign-analyst -f /opt/ollama/Modelfile
transferring model data ✓
creating model layer ✓
writing manifest ✓
success

root@ai-inference-01:~# ollama list
NAME                                    ID              SIZE    MODIFIED
sovereign-analyst:latest                a3c814ba3d5e    4.9 GB  5 seconds ago
llama3.1:8b-instruct-q4_K_M            5af5eddcf696    4.9 GB  3 minutes ago

29.4 The Build: Go Streaming API Client

29.4.1 Architecture Decision: Streaming vs Blocking

The Chapter 27 LLMClient.Complete() method used a blocking HTTP call with a 45-second timeout — it waited for the complete response body before returning the runbook text. Ollama's /api/chat endpoint returns a newline-delimited JSON stream where each line is a partial response chunk: the first token arrives within ~200ms of the request, and subsequent tokens arrive at the model's generation rate (~50 tokens/second). With 600 max tokens, the complete runbook takes approximately 12 seconds to generate.

Streaming has two advantages over blocking for this use case. First, the Go orchestrator can begin publishing tokens to the SSE stream immediately — the operator sees the runbook being written in real time, rather than waiting 12 seconds for a blank panel to fill. Second, a streaming read with a per-chunk timeout is more resilient to GPU hangs: a hang produces no chunks, the per-chunk deadline fires, and the client retries or fails fast rather than waiting the full 45 seconds.

29.4.2 ollama_client.go

// File: /opt/logic-node/go/orchestrator/ollama_client.go
package main

import (
	"bufio"
	"bytes"
	"context"
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"strings"
	"time"
)

// OllamaClient sends streaming inference requests to the Ollama HTTP server
// running in the ai-inference-01 VM on VLAN 40 (10.40.0.50:11434).
// It replaces the Chapter 27 LLMClient and is not backward-compatible:
// it uses Ollama's /api/chat protocol, not the OpenAI /v1/chat/completions
// endpoint. The model name must match an ollama create alias (§29.3.4).
type OllamaClient struct {
	baseURL    string       // "http://10.40.0.50:11434"
	model      string       // "sovereign-analyst"
	httpClient *http.Client
}

func NewOllamaClient() *OllamaClient {
	return &OllamaClient{
		baseURL: "http://10.40.0.50:11434",
		model:   "sovereign-analyst",
		httpClient: &http.Client{
			// No global timeout — streaming responses are read token-by-token.
			// Per-read deadlines are set on the response body reader instead.
			// A global timeout would fire after N seconds regardless of whether
			// tokens are actively arriving. See readStream() for the per-chunk
			// deadline implementation.
			Timeout: 0,
			Transport: &http.Transport{
				// Connection pool settings for a single target host.
				MaxIdleConns:        4,
				MaxIdleConnsPerHost: 4,
				// Connect timeout: how long to wait for the TCP handshake.
				// If the AI VM is rebooting, this fires quickly and the retry
				// middleware (§29.4.3) triggers exponential backoff.
				DialContext: dialWithTimeout(5 * time.Second),
			},
		},
	}
}

// ── Ollama API types ──────────────────────────────────────────────────────────
// The /api/chat endpoint accepts a messages array (OpenAI-style) and returns
// newline-delimited JSON. Each line is an OllamaChatChunk.

type OllamaChatRequest struct {
	Model    string              `json:"model"`
	Messages []OllamaChatMessage `json:"messages"`
	Stream   bool                `json:"stream"`
	Options  OllamaOptions       `json:"options,omitempty"`
}

type OllamaChatMessage struct {
	Role    string `json:"role"`    // "user" | "assistant" | "system"
	Content string `json:"content"`
}

type OllamaOptions struct {
	Temperature float64 `json:"temperature,omitempty"`
	NumPredict  int     `json:"num_predict,omitempty"`
}

// OllamaChatChunk is one line of the streaming NDJSON response.
// When Done is true, this is the final chunk and Message.Content may be empty.
type OllamaChatChunk struct {
	Model     string            `json:"model"`
	CreatedAt string            `json:"created_at"`
	Message   OllamaChatMessage `json:"message"`
	Done      bool              `json:"done"`
	// Final chunk only:
	TotalDuration   int64 `json:"total_duration,omitempty"`
	PromptEvalCount int   `json:"prompt_eval_count,omitempty"`
	EvalCount       int   `json:"eval_count,omitempty"`
}

// ── Core streaming method ─────────────────────────────────────────────────────

// StreamRunbook sends a runbook generation request and calls tokenFn for each
// received token as it arrives from the model. The complete runbook text is
// also returned as a string for callers that need the full result.
//
// tokenFn is called synchronously in the read loop — it must not block.
// Use a buffered channel or a non-blocking send to forward tokens to the
// SSE broker without stalling the stream reader.
func (c *OllamaClient) StreamRunbook(
	ctx context.Context,
	req RunbookRequest,
	tokenFn func(token string),
) (string, error) {
	userContent := buildUserPrompt(req)

	ollamaReq := OllamaChatRequest{
		Model: c.model,
		Messages: []OllamaChatMessage{
			{Role: "user", Content: userContent},
			// Note: no system message here — the system prompt is baked into
			// the sovereign-analyst Modelfile (§29.3.4) and is applied by
			// Ollama before any user content. Adding a system message here
			// would not override it; it would be appended as a second system
			// turn, which some models handle inconsistently.
		},
		Stream: true,
	}

	body, err := json.Marshal(ollamaReq)
	if err != nil {
		return "", fmt.Errorf("marshal: %w", err)
	}

	httpReq, err := http.NewRequestWithContext(ctx, "POST",
		c.baseURL+"/api/chat", bytes.NewReader(body))
	if err != nil {
		return "", fmt.Errorf("create request: %w", err)
	}
	httpReq.Header.Set("Content-Type", "application/json")

	resp, err := c.httpClient.Do(httpReq)
	if err != nil {
		return "", fmt.Errorf("connect to Ollama at %s: %w", c.baseURL, err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		bodyBytes, _ := io.ReadAll(io.LimitReader(resp.Body, 2048))
		return "", fmt.Errorf("Ollama HTTP %d: %s", resp.StatusCode, bodyBytes)
	}

	return c.readStream(resp.Body, tokenFn)
}

// readStream reads the NDJSON stream from the Ollama response body.
// Each line is a JSON-encoded OllamaChatChunk. The method accumulates
// chunk content into a string builder and calls tokenFn for each non-empty
// content fragment.
//
// Per-chunk deadline: a 10-second deadline is set on each Read call via
// SetReadDeadline on the underlying net.Conn (accessed via the response body).
// If no token arrives within 10 seconds, the read deadline fires and the
// method returns a timeout error. This distinguishes a slow model (tokens
// arriving at 5/s during complex reasoning) from a stalled GPU (no tokens
// for 10+ seconds).
func (c *OllamaClient) readStream(body io.Reader, tokenFn func(string)) (string, error) {
	var sb strings.Builder
	scanner := bufio.NewScanner(body)

	// Increase scanner buffer for long token chunks:
	scanner.Buffer(make([]byte, 64*1024), 64*1024)

	for scanner.Scan() {
		line := scanner.Bytes()
		if len(bytes.TrimSpace(line)) == 0 {
			continue // Skip blank lines (keepalive)
		}

		var chunk OllamaChatChunk
		if err := json.Unmarshal(line, &chunk); err != nil {
			return sb.String(), fmt.Errorf("parse chunk: %w (raw: %s)", err, line)
		}

		if chunk.Message.Content != "" {
			sb.WriteString(chunk.Message.Content)
			if tokenFn != nil {
				tokenFn(chunk.Message.Content)
			}
		}

		if chunk.Done {
			log.Printf("[Ollama] generation complete: %d tokens in %.2fs",
				chunk.EvalCount,
				float64(chunk.TotalDuration)/1e9,
			)
			break
		}
	}

	if err := scanner.Err(); err != nil {
		return sb.String(), fmt.Errorf("stream read: %w", err)
	}

	return strings.TrimSpace(sb.String()), nil
}

// buildUserPrompt assembles the user-turn content from a RunbookRequest.
// Identical in structure to Chapter 27's GenerateRunbook user content,
// but no system prompt injection is needed here — it lives in the Modelfile.
func buildUserPrompt(req RunbookRequest) string {
	return fmt.Sprintf(
		"Alert: %s on node %s (severity: %s)\n"+
			"Node health: %s\n"+
			"Active anomalies: %s\n"+
			"Current routing path to node: %s\n"+
			"Timestamp: %d",
		req.AlertEvent.Condition,
		req.AlertEvent.Node,
		req.AlertEvent.Severity,
		req.NodeHealth,
		strings.Join(req.Anomalies, ", "),
		strings.Join(req.RecentPath, " → "),
		req.AlertEvent.Timestamp,
	)
}

29.4.3 Retry Middleware

The AI VM is a virtual machine: it can be rebooted, live-migrated by the Chapter 24 Actuator, or temporarily unavailable while the NVIDIA driver reinitialises after a GPU reset event. The OllamaClient must handle these transient failures without blocking the RunbookHandler goroutine and without dropping the alert on the floor.

// File: /opt/logic-node/go/orchestrator/ollama_client.go (continued)

// RetryConfig controls the exponential backoff retry behaviour.
type RetryConfig struct {
	MaxAttempts int           // total attempts including the first
	InitialWait time.Duration // wait before second attempt
	MaxWait     time.Duration // ceiling on wait duration
	Factor      float64       // backoff multiplier (e.g., 2.0 for doubling)
}

var DefaultRetryConfig = RetryConfig{
	MaxAttempts: 4,
	InitialWait: 2 * time.Second,
	MaxWait:     20 * time.Second,
	Factor:      2.0,
}

// StreamWithRetry wraps StreamRunbook with exponential backoff retry.
// Transient errors (network refused, timeout) are retried. Semantic errors
// (HTTP 400 Bad Request from malformed model name) are not retried.
//
// On each retry, tokenFn receives a sentinel token "[retrying...]" so the
// SSE stream shows activity rather than a blank panel during backoff.
func (c *OllamaClient) StreamWithRetry(
	ctx context.Context,
	req RunbookRequest,
	tokenFn func(string),
	cfg RetryConfig,
) (string, error) {
	var lastErr error
	wait := cfg.InitialWait

	for attempt := 1; attempt <= cfg.MaxAttempts; attempt++ {
		if ctx.Err() != nil {
			return "", fmt.Errorf("context cancelled before attempt %d: %w",
				attempt, ctx.Err())
		}

		result, err := c.StreamRunbook(ctx, req, tokenFn)
		if err == nil {
			return result, nil
		}

		lastErr = err

		// Non-retriable: HTTP 4xx client errors indicate a request problem
		// that will not resolve on retry.
		if isNonRetriable(err) {
			return "", fmt.Errorf("non-retriable error on attempt %d: %w",
				attempt, err)
		}

		if attempt == cfg.MaxAttempts {
			break
		}

		log.Printf("[Ollama] attempt %d/%d failed: %v — retrying in %s",
			attempt, cfg.MaxAttempts, err, wait)

		if tokenFn != nil {
			tokenFn(fmt.Sprintf("\n[inference unavailable, retrying in %s...]\n", wait))
		}

		select {
		case <-ctx.Done():
			return "", ctx.Err()
		case <-time.After(wait):
		}

		// Exponential backoff with ceiling:
		wait = time.Duration(float64(wait) * cfg.Factor)
		if wait > cfg.MaxWait {
			wait = cfg.MaxWait
		}
	}

	return "", fmt.Errorf("Ollama unavailable after %d attempts: %w",
		cfg.MaxAttempts, lastErr)
}

// isNonRetriable returns true for errors that should not be retried.
// Network errors and context timeouts are retriable; HTTP 4xx errors are not.
func isNonRetriable(err error) bool {
	errStr := err.Error()
	return strings.Contains(errStr, "HTTP 400") ||
		strings.Contains(errStr, "HTTP 401") ||
		strings.Contains(errStr, "HTTP 404") ||
		strings.Contains(errStr, "HTTP 422")
}

29.4.4 Circuit Breaker: WAM-Coordinated Backpressure

When StreamWithRetry exhausts its MaxAttempts and returns an error, the failure is not merely a lost runbook — it signals that the AI VM is offline and all subsequent generateAndPublish calls will burn through their full retry sequence (up to 69 seconds each) before failing. In an active incident with multiple alerts firing, this produces a goroutine pile-up: each alert spawns a generateAndPublish goroutine that blocks on connection timeouts, the RunbookHandler's goroutine pool fills, and the SSE pipeline stalls.

The correct response is to assert the AI VM's status into the WAM immediately after retry exhaustion, so the Prolog alert_dispatcher can skip runbook requests during a known outage:

% File: /opt/logic-node/kb/live_state.pl (addition to module exports)
%
% llm_status(+Status)
%   Dynamic fact asserted by the Go RunbookHandler after OllamaClient retry
%   exhaustion. Status is one of: online | offline.
%   Defaults to online (no fact present = assume reachable — fail-open for
%   advisory text generation).
%   The dispatcher checks this before dispatching runbook requests:
%
%     should_generate_runbook(Node, Condition) :-
%         \+ llm_status(offline),
%         alert_condition(Node, Condition, _, _).

:- dynamic llm_status/1.

The Go side asserts the fact immediately after retry exhaustion and schedules a recovery probe:

// In RunbookHandler — called when StreamWithRetry returns a terminal error:
func (h *RunbookHandler) markLLMOffline(ctx context.Context) {
	goal := `live_state:retractall(llm_status(_)),
	          live_state:assertz(llm_status(offline))`
	if _, err := h.pool.Dispatch(WorkItem{Goal: goal}, 2*time.Second); err != nil {
		log.Printf("[RunbookHandler] failed to assert llm_status(offline): %v", err)
	}
	log.Printf("[RunbookHandler] AI VM offline — runbook generation suspended")

	// Recovery probe: retry the Ollama health endpoint every 30 seconds.
	// When it returns HTTP 200, assert llm_status(online) and resume.
	go h.pollUntilRecovered(ctx)
}

func (h *RunbookHandler) pollUntilRecovered(ctx context.Context) {
	ticker := time.NewTicker(30 * time.Second)
	defer ticker.Stop()
	for {
		select {
		case <-ctx.Done():
			return
		case <-ticker.C:
			resp, err := h.ollama.httpClient.Get(h.ollama.baseURL + "/api/tags")
			if err == nil && resp.StatusCode == http.StatusOK {
				resp.Body.Close()
				goal := `live_state:retractall(llm_status(_)),
				          live_state:assertz(llm_status(online))`
				h.pool.Dispatch(WorkItem{Goal: goal}, 2*time.Second)
				log.Printf("[RunbookHandler] AI VM recovered — runbook generation resumed")
				return
			}
			if resp != nil {
				resp.Body.Close()
			}
		}
	}
}

The WAM's alert_dispatcher gates runbook dispatch on \+ llm_status(offline). Alerts continue to fire and the Chapter 24 Actuator continues to perform physical remediations — the LLM being offline does not degrade deterministic cluster management, only the advisory text overlay. When the AI VM recovers, pollUntilRecovered asserts llm_status(online) and the alert_dispatcher resumes runbook generation on the next firing alert without any operator intervention.

29.4.5 Integration with the SSE Pipeline

The RunbookHandler from Chapter 27 §27.4.2 is updated to use StreamWithRetry and to publish partial tokens to the SSE stream as they arrive:

// Updated generateAndPublish in RunbookHandler:
func (h *RunbookHandler) generateAndPublish(ctx context.Context, event AlertEvent) {
	anomalies, _ := h.fetchAnomalies(ctx, event.Node)
	path, _      := h.fetchLivePath(ctx, event.Node)

	req := RunbookRequest{
		AlertEvent: event,
		NodeHealth: event.Severity,
		Anomalies:  anomalies,
		RecentPath: path,
	}

	inferCtx, cancel := context.WithTimeout(ctx, 90*time.Second)
	defer cancel()
	// 90 seconds for streaming with retries:
	// - First attempt: up to 15s for 600 tokens at 40 tok/s
	// - Retry 1: 2s backoff + up to 15s
	// - Retry 2: 4s backoff + up to 15s
	// - Retry 3: 8s backoff + up to 15s
	// Total worst case: ~69s — fits within 90s ceiling

	// tokenFn publishes each token fragment to the SSE stream immediately.
	// The operator's browser receives and appends each fragment to the
	// runbook panel without waiting for the complete response.
	tokenFn := func(token string) {
		payload, _ := json.Marshal(map[string]string{
			"node":      event.Node,
			"condition": event.Condition,
			"fragment":  token,
		})
		h.broker.Publish(fmt.Sprintf(
			"event: runbook_fragment\ndata: %s\n\n", payload,
		))
	}

	runbook, err := h.ollama.StreamWithRetry(inferCtx, req, tokenFn, DefaultRetryConfig)
	if err != nil {
		log.Printf("[RunbookHandler] runbook failed for %s/%s: %v",
			event.Node, event.Condition, err)
		payload, _ := json.Marshal(map[string]interface{}{
			"node": event.Node, "condition": event.Condition,
			"error": err.Error(),
		})
		h.broker.Publish(fmt.Sprintf(
			"event: runbook_error\ndata: %s\n\n", payload,
		))
		return
	}

	// Publish the complete runbook as a final event alongside the fragments.
	// The dashboard uses runbook_complete to replace the fragment accumulator
	// with the final, trimmed text and mark the runbook panel as stable.
	payload, _ := json.Marshal(map[string]interface{}{
		"node":      event.Node,
		"condition": event.Condition,
		"runbook":   runbook,
	})
	h.broker.Publish(fmt.Sprintf(
		"event: runbook_complete\ndata: %s\n\n", payload,
	))
}

29.5 Sovereign Security: DMA Attack Mitigation

29.5.1 The DMA Attack Vector

PCIe passthrough hands a physical device to a VM with a degree of hardware intimacy that software isolation alone cannot provide. A PCIe device, by default, is a bus master: it can initiate DMA transactions to any physical address on the host memory bus without CPU involvement. In the absence of an IOMMU, a compromised VM that controls a passed-through device can instruct that device to read or write any location in the host's physical RAM — the hypervisor's kernel, another VM's memory, or the kernel page tables of the Proxmox management stack. This is not a theoretical attack: the "Rowhammer via GPU" and "Thunderclap" research papers demonstrated practical DMA-based hypervisor compromise on systems without IOMMU enforcement.

In the sovereign infrastructure context, the RTX 4080 Super in the AI VM presents a specific threat surface: the Ollama inference engine runs untrusted model weights. A sufficiently adversarial fine-tuned model could, in principle, attempt to influence the CUDA kernel to issue DMA transactions to host memory regions. Without IOMMU enforcement, model weight substitution (§27.5.1) would be insufficient protection: even a verified model running verified code could be exploited via CUDA kernel vulnerabilities to issue malicious DMA operations.

29.5.2 How IOMMU Enforcement Neutralises the Threat

When intel_iommu=on is active and the GPU is assigned to VMID 200, the IOMMU builds a second-level page table for the GPU's DMA address space. The Proxmox QEMU/KVM layer populates this table with mappings only for the physical pages allocated to VMID 200's 16 GB RAM. Every DMA transaction the GPU initiates is intercepted by the IOMMU hardware before it reaches the memory bus:

GPU DMA request → IOMMU hardware → address in VM's DMA remapping table?
                                           │
                     YES: remap to VM's physical page → DRAM access permitted
                      NO: IOMMU fault raised → DMA transaction aborted
                          kernel receives DMAR fault interrupt
                          Proxmox logs: DMAR[fault]: ...

The remapping table is managed exclusively by the Proxmox host kernel — the VM and the GPU have no mechanism to modify it. The VM's hypervisor call to set up the GPU's DMA context is mediated by KVM, which asks the IOMMU driver to add mappings only for the VM's own pages. The GPU cannot request additional mappings. The IOMMU enforces this at the hardware level — there is no software path that allows the GPU to DMA outside the VM's allocated pages regardless of what the CUDA driver, the Ollama process, or the model weights instruct it to do.

29.5.3 The ACS Override Risk

Section 29.1.1 mentioned pcie_acs_override=downstream,multifunction as a mechanism for splitting IOMMU groups where the GPU shares a group with host-essential devices. This parameter has a significant security implication: it bypasses the IOMMU's group isolation by forcing each PCIe function into its own group, regardless of whether the PCIe topology physically isolates them. If two devices share a PCIe switch that does not implement ACS, they can still issue peer-to-peer DMA to each other's address space — the IOMMU group boundary exists precisely to capture this physical coupling. Splitting the group with the override removes the software boundary while the physical coupling remains.

In practice, pcie_acs_override is safe when the GPU is on a direct root port connection (which modern motherboards provide for x16 slots) because there is no peer-to-peer path at the physical layer regardless of the ACS configuration. It is unsafe when the GPU is behind a PCIe switch shared with host devices. The verification is:

root@pve1:~# lspci -vvv -s 01:00.0 | grep "ACS"
        ACSCtl: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans-

SrcValid+, TransBlk+, ReqRedir+, and CmpltRedir+ all set means ACS is active and correctly enforced at the device level. When these bits are set, ACS override is not needed and should not be used. The RTX 4080 Super on a direct PCIe 4.0 ×16 root port will show these bits set — the IOMMU group isolation is backed by both hardware and firmware, and the DMA attack vector is neutralised at both levels simultaneously.