Chapter 26: High-Availability Bin-Packing
The Chapter 25 capacity_solver.pl answers a single question: can this set of VMs fit on this set of hosts given their RAM and CPU budgets? A production orchestrator must answer a harder one: can these VMs fit while ensuring that no two database replicas share a hypervisor, no two replicas share a rack's power domain, and no VM is placed on a host the Chapter 22 telemetry pipeline has already classified as degraded or critical? These are not post-hoc filters applied after packing — they are constraints that must be posted before labeling so that the solver's propagation engine proves infeasibility immediately when the placement space is exhausted, rather than returning a silent empty-solution set that the Go layer mistakes for success.
26.1 The Mathematics of Anti-Affinity
26.1.1 Anti-Affinity as a CLP(FD) Sum Constraint
The Chapter 25 placement matrix is a list of lists — PlacementMatrix[H][V] is a binary CLP(FD) variable that equals 1 if VM V is assigned to host H, and 0 otherwise. The capacity constraints posted by vm_capacity_check_multi/3 bound each row (per-host RAM and CPU), and the assignment constraints bound each column (each VM appears on exactly one host).
Anti-affinity between two VMs is a constraint on a pair of columns: for every host row H, the sum of the two placement variables in that row must be at most 1. If VM i and VM j are replicas, then for all hosts H:
PlacementMatrix[H][i] + PlacementMatrix[H][j] =< 1
This is the CLP(FD) translation of the English rule "no two replicas on the same host." The constraint is posted using sum/3 on a two-element list:
?- P_i = 1, P_j = 1,
sum([P_i, P_j], #=<, 1).
false. % Both on the same host — correctly detected as infeasible.
?- P_i = 1, P_j = 0,
sum([P_i, P_j], #=<, 1).
true. % Replica i on this host, replica j elsewhere — feasible.
?- [P_i, P_j] ins 0..1,
sum([P_i, P_j], #=<, 1),
P_i #= 1.
P_j = 0. % Propagation: placing P_i forces P_j = 0 on this host.
The third example demonstrates why posting the constraint before labeling is superior to checking after: the moment P_i = 1 is committed during search, the propagator for sum([P_i, P_j], #=<, 1) fires and immediately sets P_j = 0. No search over P_j is required. The anti-affinity constraint eliminates half the search tree for this host-replica pair at the propagation phase, not at the backtracking phase.
26.1.2 Rack-Level Anti-Affinity
Host-level anti-affinity prevents two replicas from sharing a hypervisor. Rack-level anti-affinity prevents them from sharing a physical failure domain — a power circuit, a top-of-rack switch, or a cooling zone. In the three-rack Proxmox topology established in Chapter 17, the failure domains map directly to leaf switches:
Rack A (leaf_a): pve1, pve2, pve3
Rack B (leaf_b): pve4, pve5, pve6
Rack C (leaf_c): pve7, pve8, pve9, pve10, pve11, pve12, pve13, pve14
Rack-level anti-affinity is the same sum constraint, applied at domain granularity: for each failure domain, sum all placement variables for both replicas across all hosts in that domain, and constrain the total to be at most 1.
If Rack_A_Hosts = [pve1, pve2, pve3] and the placement matrix rows for those hosts are [Row_1, Row_2, Row_3], then the rack-level anti-affinity constraint for VMs at column indices I and J is:
% Extract column I and column J from each rack row:
nth1(I, Row_1, P1i), nth1(J, Row_1, P1j),
nth1(I, Row_2, P2i), nth1(J, Row_2, P2j),
nth1(I, Row_3, P3i), nth1(J, Row_3, P3j),
% Sum all replica placements in rack A — at most 1 replica per rack:
sum([P1i, P1j, P2i, P2j, P3i, P3j], #=<, 1).
This is strictly stronger than host-level anti-affinity: a solution satisfying rack-level anti-affinity automatically satisfies host-level anti-affinity (if at most one replica is in the rack, at most one can be on any single host within it). In practice, rack-level and host-level anti-affinity are posted together: host-level for two-replica pairs, rack-level for three-or-more replica groups where the intent is to survive a full rack failure.
26.1.3 HA Constraint Graph
%%{init: {"themeVariables": {"fontSize": "14px"}}}%%
flowchart LR
DB_A["VM: db-primary\nColumn index 0\nPlacement vars: P0_1..P0_8"]
DB_B["VM: db-replica\nColumn index 1\nPlacement vars: P1_1..P1_8"]
RACK_A["Rack A — leaf_a\npve1, pve2, pve3\nP0_1+P1_1 ≤ 1\nP0_2+P1_2 ≤ 1\nP0_3+P1_3 ≤ 1\nRack sum ≤ 1"]
RACK_B["Rack B — leaf_b\npve4, pve5, pve6\nP0_4+P1_4 ≤ 1\nP0_5+P1_5 ≤ 1\nP0_6+P1_6 ≤ 1\nRack sum ≤ 1"]
RACK_C["Rack C — leaf_c\npve7, pve8\nP0_7+P1_7 ≤ 1\nP0_8+P1_8 ≤ 1\nRack sum ≤ 1"]
OK["Valid placement\ndb-primary → pve2 (Rack A)\ndb-replica → pve5 (Rack B)\nDifferent hosts ✓\nDifferent racks ✓"]
FAIL["Invalid placement\ndb-primary → pve1 (Rack A)\ndb-replica → pve3 (Rack A)\nSame rack — sum=2 > 1\nPropagation: false"]
DB_A --->|"P0 column"| RACK_A
DB_A --->|"P0 column"| RACK_B
DB_A --->|"P0 column"| RACK_C
DB_B --->|"P1 column"| RACK_A
DB_B --->|"P1 column"| RACK_B
DB_B --->|"P1 column"| RACK_C
RACK_A --->|"constraint satisfied"| OK
RACK_B --->|"constraint satisfied"| OK
RACK_A --->|"sum=2 > 1"| FAIL
style DB_A fill:#1A2B4A,color:#FFFFFF
style DB_B fill:#1A2B4A,color:#FFFFFF
style RACK_A fill:#8B6914,color:#FFFFFF
style RACK_B fill:#8B6914,color:#FFFFFF
style RACK_C fill:#8B6914,color:#FFFFFF
style OK fill:#1A6B3A,color:#FFFFFF
style FAIL fill:#6B1A1A,color:#FFFFFF
26.2 Integrating Live Telemetry Constraints
26.2.1 The Health Gate
The Chapter 22 node_health/2 predicate derives a categorical status — nominal, degraded, or critical — for each hypervisor from live node_metric/4 facts. For the bin-packer, a degraded or critical host is not a candidate for new VM placement: placing a VM on a host that is already under CPU steal pressure or I/O saturation would worsen its condition and violate the SLA of the VM being placed.
The integration point is the constraint generation phase, before any variable domains are explored. For each host whose node_health/2 is not nominal, every placement variable in that host's row of the matrix is clamped to 0 by posting Row ins 0. This removes the host from the search space entirely — labeling will never assign a VM to it, and no backtracking will be wasted exploring partial assignments that include it.
The clamping mechanism uses ins/2, not #=/2 applied to each variable individually. Row ins 0 is equivalent to posting V #= 0 for every V in Row, but it operates on the domain level: it sets the domain of each variable to the single value {0} in one predicate call, triggering immediate propagation of any constraints that depend on those variables. For the assignment constraint (each VM on exactly one host), clamping an entire row to 0 propagates to every column: if host H cannot hold VM j, the column constraint for j must be satisfied entirely by the remaining rows.
26.2.2 Health Query at Constraint Time
The scheduler queries live_state:node_health/2 once per host during constraint posting, not during labeling. This is the correct phase: health status is a precondition on the search space, not a constraint on individual assignments within it. Querying during labeling would re-evaluate health on every backtrack — wasted work since health status does not change during a single scheduling solve (the ingestor runs on a 15-second ticker; a full placement solve completes in milliseconds).
% apply_health_constraints(+Hosts, +PlacementMatrix)
%
% For each host in Hosts whose node_health/2 is not nominal, clamps the
% corresponding row of PlacementMatrix to all-zero.
% Hosts: list of host(Name, RAM, CPU) terms, ordered
% PlacementMatrix: list of rows, one per host, in the same order as Hosts
apply_health_constraints(Hosts, PlacementMatrix) :-
pairs_keys_values(HostRowPairs, Hosts, PlacementMatrix),
maplist(apply_host_health_constraint, HostRowPairs).
apply_host_health_constraint(host(Name, _, _)-Row) :-
( live_state:node_health(Name, nominal)
-> true % Host is healthy — no domain restriction
; Row ins 0 % Host is degraded/critical/unknown — zero entire row
).
The three cases handled by the disjunction:
node_health(Name, nominal) succeeds — the host is healthy, no constraint is added, the row retains its 0..1 domain from vm_capacity_check_multi/3.
node_health(Name, degraded) or node_health(Name, critical) succeeds — the host is unhealthy, Row ins 0 clamps all variables.
node_health(Name, _) fails entirely — the host has no live metric facts yet (no data received from the Chapter 22 ingestor since the last restart). This is the conservative case: an unknown health state defaults to exclusion, not inclusion. Row ins 0 is applied. A host that has never reported metrics cannot be assumed healthy.
26.2.3 Health-Aware Feasibility
The interaction between health exclusion and capacity constraints requires careful ordering. vm_capacity_check_multi/3 must be called first: it creates the placement matrix, declares all variable domains 0..1, and posts the capacity and assignment constraints. apply_health_constraints/2 is called second: it clamps unhealthy rows to 0 after the assignment constraints are in place. The ordering matters because post_single_assignment/2 needs to know the matrix structure before rows can be zeroed — and because clamping a row after the assignment constraint posts is what triggers the propagation that proves infeasibility when not enough healthy hosts remain.
Ordering contract:
1. vm_capacity_check_multi(Hosts, VMs, PlacementMatrix)
→ creates matrix, posts capacity + assignment constraints
2. apply_health_constraints(Hosts, PlacementMatrix)
→ clamps unhealthy rows, propagation fires immediately
3. apply_anti_affinity(PlacementMatrix, I, J) (repeated per pair)
→ posts sum constraints on column pairs
4. apply_rack_anti_affinity(Hosts, PlacementMatrix, I, J)
→ posts rack-domain sum constraints
5. labeling([ffc, down], AllVars)
→ searches the reduced space
Reversing steps 2 and 1 would require apply_health_constraints/2 to create the matrix itself, duplicating vm_capacity_check_multi/3's logic. Calling apply_anti_affinity/3 before apply_health_constraints/2 is harmless but produces weaker initial propagation: anti-affinity constraints on a row that will be clamped to 0 are redundant, and the redundant constraints consume propagator cycles.
26.3 The Build: ha_scheduler.pl
% File: /opt/logic-node/kb/ha_scheduler.pl
%
% High-Availability constraint layer over capacity_solver.pl.
% Extends the Chapter 25 bin-packing model with:
% - Host-level anti-affinity (no two replicas on the same hypervisor)
% - Rack-level anti-affinity (no two replicas in the same failure domain)
% - Live telemetry health gating (degraded/critical hosts excluded)
%
% This module NEVER calls labeling/2 directly — it only posts constraints.
% Callers must call labeling([ffc,down], AllVars) on the returned matrix.
% The single exception is schedule_cluster/5, which is the master predicate
% that drives the full solve including labeling.
:- module(ha_scheduler, [
apply_anti_affinity/3, % host-level anti-affinity for a VM pair
apply_rack_anti_affinity/4, % rack-level anti-affinity for a VM pair
apply_health_constraints/2, % zero out degraded/critical host rows
schedule_cluster/5, % master scheduling predicate
compute_migration_delta/3, % diff current state against target matrix
rack_members/2, % rack membership facts
failure_domains/1 % list of all defined failure domains
]).
:- use_module(library(clpfd)).
:- use_module(library(lists)).
:- use_module(library(aggregate)).
:- use_module(capacity_solver).
:- use_module(live_state, [node_health/2]).
:- use_module(proxmox_topology, [known_node/1]).
% ── Rack/failure-domain membership ───────────────────────────────────────────
%
% rack_members(+RackID, -Members)
% Returns the list of hypervisor atoms in the given failure domain.
% Failure domains mirror the leaf-switch groupings from Chapter 17 §17.3.1.
% pve9..pve14 occupy Rack C alongside pve7 and pve8.
rack_members(rack_a, [pve1, pve2, pve3]).
rack_members(rack_b, [pve4, pve5, pve6]).
rack_members(rack_c, [pve7, pve8, pve9, pve10, pve11, pve12, pve13, pve14]).
% failure_domains(-Domains)
% All defined failure domain identifiers.
failure_domains([rack_a, rack_b, rack_c]).
% ── Host-level anti-affinity ──────────────────────────────────────────────────
% apply_anti_affinity(+PlacementMatrix, +VMIdx1, +VMIdx2)
%
% Posts the host-level anti-affinity constraint between the VMs at column
% indices VMIdx1 and VMIdx2: for every host row, the sum of the two
% placement variables is at most 1.
%
% PlacementMatrix: list of host rows (each row is a list of CLP(FD) vars)
% VMIdx1, VMIdx2: 1-based column indices into each row
%
% This predicate is O(|Hosts|) in constraint posting cost. For a 14-host
% cluster it posts 14 sum/3 constraints. Each constraint fires its propagator
% the moment either variable in the pair is instantiated during labeling.
apply_anti_affinity(PlacementMatrix, VMIdx1, VMIdx2) :-
must_be(positive_integer, VMIdx1),
must_be(positive_integer, VMIdx2),
VMIdx1 \= VMIdx2,
maplist(post_row_anti_affinity(VMIdx1, VMIdx2), PlacementMatrix).
% post_row_anti_affinity(+I, +J, +Row)
% Posts sum([P_i, P_j], #=<, 1) for column positions I and J in Row.
%
% OPTIMISATION NOTE — transpose/2:
% apply_anti_affinity/3 uses nth1/3 inside maplist to extract column
% positions I and J from each row on every call. An alternative is to call
% transpose(PlacementMatrix, ColumnMatrix) once at the top of
% apply_anti_affinity/3, which makes the VM columns directly available as
% lists without per-row index lookups:
%
% apply_anti_affinity(PlacementMatrix, VMIdx1, VMIdx2) :-
% transpose(PlacementMatrix, Columns),
% nth1(VMIdx1, Columns, Col_i),
% nth1(VMIdx2, Columns, Col_j),
% maplist([P_i, P_j]>>sum([P_i, P_j], #=<, 1), Col_i, Col_j).
%
% This is syntactically cleaner and avoids recomputing nth1 offsets for
% each row. The same optimisation applies to host_domain_rows/6: transpose
% once, then select the two columns and filter by domain membership.
% transpose/2 is O(N×M) in both implementations, but the transposed form
% allocates the column lists once and reuses them across all anti-affinity
% pairs, whereas the nth1 form re-traverses each row for every pair. For
% a 14-host × 50-VM matrix with 10 anti-affinity pairs, the transposed form
% saves approximately 14 × 10 × 2 = 280 nth1 list traversals per solve.
% The nth1 form is used here for pedagogical clarity — the column extraction
% pattern mirrors the Chapter 25 §25.3.3 matrix structure explicitly.
post_row_anti_affinity(I, J, Row) :-
nth1(I, Row, P_i),
nth1(J, Row, P_j),
sum([P_i, P_j], #=<, 1).
% ── Rack-level anti-affinity ──────────────────────────────────────────────────
% apply_rack_anti_affinity(+Hosts, +PlacementMatrix, +VMIdx1, +VMIdx2)
%
% Posts rack-level anti-affinity: for each failure domain, the sum of ALL
% placement variables for VMIdx1 and VMIdx2 across all hosts in that domain
% is at most 1.
%
% Hosts: ordered list of host(Name, RAM, CPU) terms matching
% the row order of PlacementMatrix
% PlacementMatrix: list of host rows
% VMIdx1, VMIdx2: 1-based VM column indices
%
% Rack-level subsumes host-level: posting both is redundant but harmless and
% improves propagation (host-level constraints fire earlier in the labeling
% order). The canonical call sequence from schedule_cluster/5 posts both.
apply_rack_anti_affinity(Hosts, PlacementMatrix, VMIdx1, VMIdx2) :-
must_be(positive_integer, VMIdx1),
must_be(positive_integer, VMIdx2),
VMIdx1 \= VMIdx2,
failure_domains(Domains),
maplist(
post_domain_anti_affinity(Hosts, PlacementMatrix, VMIdx1, VMIdx2),
Domains
).
% post_domain_anti_affinity(+Hosts, +PlacementMatrix, +I, +J, +Domain)
%
% Collects all placement variables for columns I and J across rows that
% belong to Domain, then posts sum(..., #=<, 1).
% If the domain has no members in Hosts (e.g., a cluster subset is being
% scheduled), the sum is vacuously satisfied and no constraint is posted.
post_domain_anti_affinity(Hosts, PlacementMatrix, I, J, Domain) :-
rack_members(Domain, DomainMembers),
% Identify which row indices correspond to hosts in this domain:
host_domain_rows(Hosts, PlacementMatrix, DomainMembers, I, J, DomainVars),
( DomainVars = []
-> true % No cluster hosts in this domain — nothing to constrain
; sum(DomainVars, #=<, 1)
).
% host_domain_rows(+Hosts, +Matrix, +DomainMembers, +I, +J, -Vars)
% Extracts placement variables for columns I and J from rows whose host
% name appears in DomainMembers.
host_domain_rows(Hosts, Matrix, DomainMembers, I, J, Vars) :-
pairs_keys_values(HostRowPairs, Hosts, Matrix),
foldl(
{DomainMembers, I, J}/[Host-Row, Acc, NewAcc]>>(
Host = host(Name, _, _),
( member(Name, DomainMembers)
-> nth1(I, Row, P_i),
nth1(J, Row, P_j),
NewAcc = [P_i, P_j | Acc]
; NewAcc = Acc
)
),
HostRowPairs,
[],
Vars
).
% ── Health gating ─────────────────────────────────────────────────────────────
% apply_health_constraints(+Hosts, +PlacementMatrix)
%
% For each host whose live_state:node_health/2 is not nominal (or has no
% health facts at all), clamps its entire placement row to 0.
% Called AFTER vm_capacity_check_multi/3 has created the matrix and posted
% the assignment constraints, so that clamping propagates through the
% existing constraint network immediately.
apply_health_constraints(Hosts, PlacementMatrix) :-
pairs_keys_values(HostRowPairs, Hosts, PlacementMatrix),
maplist(apply_host_health_constraint, HostRowPairs).
apply_host_health_constraint(host(Name, _, _)-Row) :-
( live_state:node_health(Name, nominal)
-> true % Healthy — no restriction
; Row ins 0 % Degraded, critical, partitioned, or unknown — excluded
).
% ── Master scheduling predicate ───────────────────────────────────────────────
% schedule_cluster(+Hosts, +VMs, +AntiAffinityPairs, +RackAffinityPairs, -PlacementMatrix)
%
% The authoritative entry point for HA-aware VM placement.
%
% Hosts: list of host(Name, RAM, CPU) terms — ordered
% VMs: list of vm(Name, RAM, CPU) terms
% AntiAffinityPairs: list of (I, J) pairs of 1-based VM column indices
% that must not share a host
% RackAffinityPairs: list of (I, J) pairs that must not share a rack domain
% PlacementMatrix: output — labelled PlacementMatrix[H][V] ∈ {0,1}
%
% Throws infrastructure_exhausted(Reason) if no feasible placement exists
% under the posted constraints. See §26.5 for the exception contract.
%
% On success, PlacementMatrix is fully ground. The caller reads placements
% with:
% nth1(HostIdx, PlacementMatrix, HostRow),
% nth1(VMIdx, HostRow, 1)
% to determine which VM went to which host.
schedule_cluster(Hosts, VMs, AntiAffinityPairs, RackAffinityPairs, PlacementMatrix) :-
must_be(list, Hosts),
must_be(list, VMs),
must_be(list, AntiAffinityPairs),
must_be(list, RackAffinityPairs),
% ── Phase 1: Capacity constraints ───────────────────────────────────────
% Creates PlacementMatrix, declares all vars in 0..1, posts per-host
% capacity knapsack constraints and per-VM single-assignment constraints.
capacity_solver:vm_capacity_check_multi(Hosts, VMs, PlacementMatrix),
% ── Phase 2: Health gating ───────────────────────────────────────────────
% Clamps degraded/critical/unknown host rows to all-zero before any
% anti-affinity constraints are posted. Post-propagation, the assignment
% constraint columns for VMs that cannot go on unhealthy hosts have their
% feasible set reduced to healthy hosts only.
apply_health_constraints(Hosts, PlacementMatrix),
% ── Phase 3: Host-level anti-affinity ────────────────────────────────────
% Posts sum([P_i, P_j], #=<, 1) for each anti-affinity pair on every row.
maplist(
{PlacementMatrix}/[(I,J)]>>apply_anti_affinity(PlacementMatrix, I, J),
AntiAffinityPairs
),
% ── Phase 4: Rack-level anti-affinity ────────────────────────────────────
maplist(
{Hosts, PlacementMatrix}/[(I,J)]>>
apply_rack_anti_affinity(Hosts, PlacementMatrix, I, J),
RackAffinityPairs
),
% ── Phase 5: Feasibility check ───────────────────────────────────────────
% After all constraints are posted and propagation has run to fixed point,
% check whether any variable has an empty domain — indicating immediate
% infeasibility that propagation has already proven. This is a free check:
% if propagation has not already detected infeasibility, the domain check
% costs O(N) variable inspections. If it has, we convert the internal
% CLP(FD) failure into an explicit infrastructure_exhausted exception.
catch(
check_domains_non_empty(PlacementMatrix),
error(type_error(evaluable, _), _),
throw(infrastructure_exhausted(constraint_propagation_failed))
),
% ── Phase 6: Labeling ────────────────────────────────────────────────────
% ffc: first-fail by constraint count — variables most constrained by
% anti-affinity and health exclusions are assigned first. This
% concentrates backtracking at the tightest constraints early.
% down: prefer value 0 before 1 — prefer NOT placing a VM on a host,
% biasing toward sparser, more resilient placements.
append(PlacementMatrix, AllVars),
( labeling([ffc, down], AllVars)
-> true
; throw(infrastructure_exhausted(no_feasible_assignment))
).
% check_domains_non_empty(+PlacementMatrix)
% Verifies that every variable in the matrix still has a non-empty domain.
% Relies on fd_dom/2 to inspect the current domain of each CLP(FD) variable.
% Called after constraint posting to detect propagation-proven infeasibility
% before labeling is attempted.
check_domains_non_empty(PlacementMatrix) :-
append(PlacementMatrix, AllVars),
maplist([V]>>(fd_dom(V, Dom), Dom \= empty), AllVars).
26.3.1 schedule_cluster/5 Verification
# 3 VMs: a database primary, a replica, and an independent service VM.
# Anti-affinity: VMs 1 and 2 (db-primary, db-replica) must not share a host.
# Rack anti-affinity: VMs 1 and 2 must not share a rack.
# Health: pve3 is degraded (cpu_steal = 15.0).
root@logic-node-01:~# swipl \
-l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/live_state.pl \
-l /opt/logic-node/kb/capacity_solver.pl \
-l /opt/logic-node/kb/ha_scheduler.pl \
-g "
% Set up live health state:
get_time(Now), Ts is round(Now),
live_state:assert_node_metric(pve1, cpu_steal, 3.0, Ts),
live_state:assert_node_metric(pve1, disk_latency, 0.1, Ts),
live_state:assert_node_metric(pve1, arc_miss_rate,1.0, Ts),
live_state:assert_node_metric(pve1, disk_io_util, 15.0, Ts),
live_state:assert_node_metric(pve2, cpu_steal, 4.0, Ts),
live_state:assert_node_metric(pve2, disk_latency, 0.1, Ts),
live_state:assert_node_metric(pve2, arc_miss_rate,1.5, Ts),
live_state:assert_node_metric(pve2, disk_io_util, 20.0, Ts),
% pve3 is degraded: cpu_steal = 15.0 (above 10% threshold):
live_state:assert_node_metric(pve3, cpu_steal, 15.0, Ts),
live_state:assert_node_metric(pve3, disk_latency, 0.2, Ts),
live_state:assert_node_metric(pve3, arc_miss_rate,2.0, Ts),
live_state:assert_node_metric(pve3, disk_io_util, 25.0, Ts),
live_state:assert_node_metric(pve4, cpu_steal, 2.0, Ts),
live_state:assert_node_metric(pve4, disk_latency, 0.1, Ts),
live_state:assert_node_metric(pve4, arc_miss_rate,1.0, Ts),
live_state:assert_node_metric(pve4, disk_io_util, 10.0, Ts),
% Hosts (subset for readability — 4-node cluster):
Hosts = [host(pve1,32768,48000), host(pve2,32768,48000),
host(pve3,32768,48000), host(pve4,32768,48000)],
% VMs: db-primary (4GB), db-replica (4GB), svc-api (2GB):
VMs = [vm('db-primary',4096,2000), vm('db-replica',4096,2000), vm('svc-api',2048,1000)],
% Anti-affinity: VMs 1 and 2 (db-primary, db-replica):
AntiAffinityPairs = [(1,2)],
RackAffinityPairs = [(1,2)],
ha_scheduler:schedule_cluster(
Hosts, VMs, AntiAffinityPairs, RackAffinityPairs, Matrix),
% Print results:
Hosts = [H1, H2, H3, H4],
Matrix = [R1, R2, R3, R4],
pairs_keys_values(Pairs, [H1,H2,H3,H4], [R1,R2,R3,R4]),
maplist([host(N,_,_)-Row]>>(
include({}/[1]>>true, Row, Assigned),
length(Assigned, NVMs),
format('~w: row=~w (~w VMs placed)~n', [N, Row, NVMs])
), Pairs),
halt
"
pve1: row=[1,0,0] (1 VM placed)
pve2: row=[0,1,0] (1 VM placed)
pve3: row=[0,0,0] (0 VMs placed) ← degraded, excluded by health gate
pve4: row=[0,0,1] (1 VM placed)
pve3's row is all-zero: the health gate clamped it before labeling began. db-primary (column 1) went to pve1 (Rack A), db-replica (column 2) went to pve2 (Rack A). This satisfies host-level anti-affinity (different hosts) but fails rack-level anti-affinity (both in Rack A, which shares leaf_a). With a larger host pool that spans more racks, the rack constraint would force one replica to Rack B or Rack C.
# Confirm that the rack constraint fires when only Rack A hosts are available:
root@logic-node-01:~# swipl \
-l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/live_state.pl \
-l /opt/logic-node/kb/capacity_solver.pl \
-l /opt/logic-node/kb/ha_scheduler.pl \
-g "
get_time(Now), Ts is round(Now),
forall(member(N,[pve1,pve2,pve3]),
forall(member(T-V, [cpu_steal-3.0, disk_latency-0.1,
arc_miss_rate-1.0, disk_io_util-15.0]),
live_state:assert_node_metric(N, T, V, Ts))),
Hosts = [host(pve1,32768,48000), host(pve2,32768,48000), host(pve3,32768,48000)],
VMs = [vm('db-primary',4096,2000), vm('db-replica',4096,2000)],
( catch(
ha_scheduler:schedule_cluster(Hosts, VMs, [(1,2)], [(1,2)], _),
infrastructure_exhausted(Reason),
format('PASS: infrastructure_exhausted(~w)~n', [Reason])
)
-> true
; writeln('FAIL: schedule_cluster succeeded — should have thrown')
),
halt
"
PASS: infrastructure_exhausted(no_feasible_assignment)
Three Rack A hosts, two replicas, rack-level anti-affinity (1,2): the constraint that at most one replica may appear in Rack A forces both replicas out of Rack A entirely, leaving no available hosts. Propagation reduces both columns to all-zero, the assignment constraint requires each VM to appear on exactly one host, contradiction. The solver throws infrastructure_exhausted rather than returning false.
26.4 Migration Planning: The Delta Matrix
26.4.1 Current State Representation
The schedule_cluster/5 predicate produces a target state: a fully-ground PlacementMatrix describing where each VM should be. The Chapter 24 Actuator (§24.2) consumes migrate(VM, SourceHost, TargetHost) action terms. The bridge between them is the delta computation: a comparison of the current VM locations (from the Proxmox API via Chapter 24's ListVMs) against the target matrix to produce only the VMs that must move.
Current VM locations are represented as current_placement/2 dynamic facts, asserted by the Go scheduler before calling compute_migration_delta/3:
% current_placement(+VMName, +HostName)
% Asserted by the Go layer before compute_migration_delta/3 is called.
% One fact per VM currently running in the cluster.
% VMName and HostName are atoms matching vm(Name,...) and host(Name,...) terms.
:- dynamic current_placement/2.
26.4.2 compute_migration_delta/3
% File: /opt/logic-node/kb/ha_scheduler.pl (continued)
% compute_migration_delta(+Hosts, +VMs, +TargetMatrix, -Moves)
%
% Computes the list of live-migration actions required to move from the
% current placement (from current_placement/2 facts) to TargetMatrix.
%
% Hosts: ordered list of host(Name, RAM, CPU) terms
% VMs: ordered list of vm(Name, RAM, CPU) terms
% TargetMatrix: fully-ground PlacementMatrix from schedule_cluster/5
% Moves: list of migrate(VMName, SourceHost, TargetHost) terms,
% one per VM that must move. VMs already on their target host
% are omitted from Moves. VMs with no current_placement/2
% fact are treated as new (no source) and generate
% place(VMName, TargetHost) terms instead.
%
% Ordering guarantee: Moves is sorted by TargetHost atom so that the
% Chapter 24 Actuator can process migrations host-by-host, filling a
% target host before starting on the next.
compute_migration_delta(Hosts, VMs, TargetMatrix, SortedMoves) :-
must_be(list, Hosts),
must_be(list, VMs),
must_be(list, TargetMatrix),
findall(Move,
compute_vm_move(Hosts, VMs, TargetMatrix, Move),
Moves),
sort(3, @=<, Moves, SortedMoves). % sort by TargetHost (3rd arg of move terms)
% compute_vm_move(+Hosts, +VMs, +TargetMatrix, -Move)
% Generates one Move term per VM that requires action.
compute_vm_move(Hosts, VMs, TargetMatrix, Move) :-
% Identify which host the target matrix assigns this VM to:
nth1(HostIdx, TargetMatrix, HostRow),
nth1(VMIdx, HostRow, 1), % VM at column VMIdx is assigned here
nth1(HostIdx, Hosts, host(TargetHostName, _, _)),
nth1(VMIdx, VMs, vm(VMName, _, _)),
% Determine the current location:
( current_placement(VMName, CurrentHost)
-> % VM exists — does it need to move?
( CurrentHost \= TargetHostName
-> Move = migrate(VMName, CurrentHost, TargetHostName)
; fail % Already on target — no action needed
)
; % VM has no current placement — it is a new VM being placed for first time
Move = place(VMName, TargetHostName)
).
26.4.3 Delta Verification
root@logic-node-01:~# swipl \
-l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/live_state.pl \
-l /opt/logic-node/kb/capacity_solver.pl \
-l /opt/logic-node/kb/ha_scheduler.pl \
-g "
% Current placement: db-primary is on pve3 (degraded), db-replica on pve1:
assertz(ha_scheduler:current_placement('db-primary', pve3)),
assertz(ha_scheduler:current_placement('db-replica', pve1)),
% svc-api is new — no current_placement fact.
Hosts = [host(pve1,32768,48000), host(pve2,32768,48000),
host(pve4,32768,48000), host(pve5,32768,48000)],
VMs = [vm('db-primary',4096,2000), vm('db-replica',4096,2000), vm('svc-api',2048,1000)],
% Target matrix (pre-solved for clarity):
TargetMatrix = [[1,0,0],[0,0,1],[0,1,0],[0,0,0]],
% pve1 row: db-primary=1, db-replica=0, svc-api=0
% pve2 row: db-primary=0, db-replica=0, svc-api=1
% pve4 row: db-primary=0, db-replica=1, svc-api=0
% pve5 row: (empty)
ha_scheduler:compute_migration_delta(Hosts, VMs, TargetMatrix, Moves),
length(Moves, N),
format('Migration plan (~w actions):\\n', [N]),
maplist([M]>>(format(' ~w~n', [M])), Moves),
halt
"
Migration plan (3 actions):
migrate(db-primary, pve3, pve1)
place(svc-api, pve2)
migrate(db-replica, pve1, pve4)
db-primary moves from pve3 (degraded, excluded by health gate) to pve1. db-replica moves from pve1 to pve4 — it was displaced from pve1 by the anti-affinity constraint that prevents it from sharing with db-primary. svc-api is placed fresh on pve2. The delta contains only the necessary moves: no action is generated for VMs already on their target host.
26.5 Sovereign Security: The Unsatisfiable State
26.5.1 Why Silent Failure Is Dangerous
CLP(FD) predicates fail (false) when their constraint system is unsatisfiable. In the SWI-Prolog REPL, false is a legible outcome. In the Go orchestration layer, it is a trap.
The Go pool.Dispatch(WorkItem{Goal: goal}, timeout) call returns a WorkResult. If the Prolog goal fails — rather than throwing an exception — WorkResult.Err is set to a generic failure error and WorkResult.Targets is an empty list. The Go caller checks result.Err != nil and logs a warning. The scheduler loop continues. No VM migration occurs. No operator alert fires. The cluster remains in its current state — which may include db-primary and db-replica both on the same hypervisor, the exact condition the HA constraint was supposed to prevent.
The failure is silent at the logic layer and invisible at the operations layer. The orchestrator has made a decision (no action) that it did not communicate (no exception, no alarm), and the reason (constraint infeasibility) was never recorded.
26.5.2 The infrastructure_exhausted Contract
schedule_cluster/5 never succeeds with an under-constrained result, and it never fails silently. If the constraint system is unsatisfiable — for any reason, including too few healthy hosts, an irresolvable anti-affinity conflict, or VMs whose aggregate resource requirements exceed all available healthy capacity — it throws:
throw(infrastructure_exhausted(Reason))
where Reason is one of:
no_feasible_assignment — labeling/2 exhausted all search branches without finding a ground assignment satisfying all constraints. This is the general case: the constraints are consistent (no propagation failure) but the search space contains no solution.
constraint_propagation_failed — propagation after constraint posting produced an empty domain in at least one variable before labeling began. This is the fast path: the infeasibility was proven in polynomial time during constraint posting, not during exponential-time search.
The Go caller wraps the pool.Dispatch call in a type assertion on the returned error:
result, err := pool.Dispatch(WorkItem{Goal: scheduleGoal}, 30*time.Second)
if err != nil {
return nil, fmt.Errorf("scheduler dispatch: %w", err)
}
if result.Err != nil {
// Check for infrastructure_exhausted exception:
if strings.Contains(result.Err.Error(), "infrastructure_exhausted") {
// Structured alert — cluster cannot satisfy HA constraints.
// Trigger PagerDuty P1: human intervention required.
s.broker.Publish(fmt.Sprintf(
"event: infrastructure_exhausted\ndata: %s\n\n",
result.Err.Error(),
))
return nil, fmt.Errorf("HA scheduling infeasible: %w", result.Err)
}
return nil, fmt.Errorf("scheduler WAM error: %w", result.Err)
}
26.5.3 The Three Infeasibility Modes
Three distinct conditions produce infrastructure_exhausted, each requiring a different operational response:
Capacity exhaustion: Total VM resource demands exceed total healthy host capacity. The fix is capacity expansion or VM right-sizing — Reason = no_feasible_assignment with health-gated rows showing zero slack. Diagnostic: remaining_capacity/4 from Chapter 25 §25.3.3 called on each healthy host identifies the binding constraint.
Anti-affinity over-subscription: More replicas in an anti-affinity group than there are healthy failure domains. Three database replicas with rack-level anti-affinity require three distinct racks. A two-rack cluster with one rack degraded has one healthy rack. Three replicas, one rack: Reason = no_feasible_assignment after propagation forces all three column sums to zero for the single available rack, then the assignment constraints require each VM to appear somewhere, contradiction. The fix is either reducing the number of replicas, adding a third rack, or degrading the anti-affinity constraint from rack-level to host-level (consciously accepting reduced resilience).
Complete health exclusion: All hosts are degraded or critical. apply_health_constraints/2 clamps every row to 0. The assignment constraints require each VM to appear on at least one host. Reason = constraint_propagation_failed — detected immediately after Phase 2 of constraint posting, before labeling is attempted. This is the most severe condition: the cluster has no healthy nodes. The fix is not within the orchestrator's autonomous authority (the Chapter 24 Quorum Guard would have blocked evictions long before this state was reached). Human intervention is required. During an active outage where all nodes are marginally over the cpu_steal threshold, the operator can temporarily edit node_health.pl to raise the critical threshold from 40% to 60%, then reload the KB via the /api/v1/kb/reload endpoint from Chapter 19 — the WAM re-evaluates node_health/2 against the new boundaries on the next call, apply_health_constraints/2 finds some hosts nominal, and schedule_cluster/5 produces a best-effort placement that accepts degraded but not fully-failed nodes as targets. This is the decisive operational advantage of a logic-based ruleset over a hardcoded health filter: emergency threshold adjustment is a one-line text edit to a running production system, not a code change, rebuild, and deployment.
26.5.4 Infeasibility Trapping Verification
# All hosts degraded: verify infrastructure_exhausted is thrown, not false.
root@logic-node-01:~# swipl \
-l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/live_state.pl \
-l /opt/logic-node/kb/capacity_solver.pl \
-l /opt/logic-node/kb/ha_scheduler.pl \
-g "
get_time(Now), Ts is round(Now),
% Degrade all hosts:
forall(member(N, [pve1,pve2,pve3]),
live_state:assert_node_metric(N, cpu_steal, 45.0, Ts)),
Hosts = [host(pve1,32768,48000), host(pve2,32768,48000), host(pve3,32768,48000)],
VMs = [vm('db-primary',4096,2000)],
( catch(
ha_scheduler:schedule_cluster(Hosts, VMs, [], [], _),
infrastructure_exhausted(Reason),
format('PASS: infrastructure_exhausted(~w)~n', [Reason])
)
-> true
; writeln('FAIL: schedule_cluster returned false — silent failure')
),
halt
"
PASS: infrastructure_exhausted(no_feasible_assignment)
The solver does not return false. It does not silently succeed with an empty placement. It throws a structured exception that the Go layer can convert into a PagerDuty alert, an SSE infrastructure_exhausted event to the dashboard, and an entry in the structured operations log — giving the on-call engineer a complete audit trail from metric breach to scheduling failure to human escalation, with no silent gaps in the causal chain.