Skip to main content

Chapter 26: The Proxmox Bin Packer

Assigning fifty virtual machines to fourteen hypervisors subject to RAM and CPU capacity constraints is a multi-dimensional, multi-knapsack NP-hard problem — the same combinatorial class that defeats every classical greedy heuristic at scale, and the class for which the CLP(FD) propagation engine was purpose-built. This chapter extends the capacity_solver.pl foundations from Chapter 25 into vm_scheduler.pl, a production scheduler that enforces an 85% safety headroom on every hypervisor using integer arithmetic inside the constraint network, tunes the labeling phase with ffc and bisect for dense packing on large matrices, and wraps the entire search in call_with_time_limit/2 to guarantee the WAM thread is never held hostage by an unsolvable placement problem.


26.1 The NP-Hard Reality of Cluster Scheduling

26.1.1 From Bin Packing to Multi-Knapsack

The classical bin-packing problem asks: given items of various sizes and bins of fixed capacity, what is the minimum number of bins required to hold all items? For a single resource dimension (volume, weight, bytes), this is already NP-hard — no polynomial-time algorithm is guaranteed to find the optimal solution for all instances.

The Proxmox cluster scheduler is harder. Each VM has two resource dimensions (RAM and CPU millicores), each hypervisor has two independent capacity bounds, and the solver must assign every VM to exactly one hypervisor without exceeding either bound on any host. This is the multi-dimensional multi-knapsack problem, also known as the vector bin-packing problem. It is NP-hard in the strong sense: not only is the decision version NP-complete, but approximating the optimal solution within a constant factor is also NP-hard under standard complexity assumptions. The 14^50 ≈ 10^57 candidate assignments computed in Chapter 25 §25.3.1 represent the brute-force decision tree size before any constraints are evaluated.

26.1.2 Why Classical Algorithms Fail

First-Fit Decreasing (FFD) is the standard greedy heuristic for bin packing. Sort items in decreasing size order; for each item, place it in the first bin that has capacity remaining. FFD produces solutions within 11/9 × OPT + 6/9 of optimal for single-dimensional packing — a 22% worst-case overhead. For a 14-hypervisor cluster, a 22% suboptimal packing means at least two or three hypervisors are underloaded while others approach capacity, wasting NUMA memory bandwidth and leaving no headroom for burst workloads.

More critically, FFD is greedy and non-backtracking: once a VM is placed, it stays placed. There is no mechanism to reopen a decision when a later VM cannot fit. In two-dimensional packing, greedy placement in one dimension routinely creates infeasible states in the other: a hypervisor has sufficient RAM for the next VM but insufficient CPU, while a different host has the reverse, and no greedy rearrangement can resolve the deadlock without undoing prior placements.

CLP(FD) is not a heuristic. It is a complete solver: given the constraint system, it either finds a valid assignment or proves that none exists. The constraint propagation phase eliminates infeasible partial assignments before the search phase explores them — not by guessing and correcting, but by deduction from the constraint network. For the multi-dimensional placement problem, this means RAM and CPU constraints propagate against each other: if posting RAM infeasibility for host H forces some VM's assignment column to exclude H, the CPU column for the same VM propagates that exclusion to all other hosts that share a CPU constraint. The pruning is global, not local.

26.1.3 The Optimality Trade-Off

CLP(FD) with labeling/2 is complete but not bounded in time. The worst case — proving that no valid assignment exists — requires exhaustive search over the reduced domain space. For a satisfiable instance with good propagation, the first solution is found quickly; for an unsatisfiable instance with poor propagation, the solver must prove exhaustive infeasibility. The correct engineering response is not to abandon completeness, but to set a time budget for the search phase and fall back to a degraded-but-safe response when the budget is exhausted. Section 26.5 implements this with call_with_time_limit/2.


26.2 The 85% Safety Threshold

26.2.1 Why 100% Is Operationally Unsafe

A hypervisor packed to 100% of its declared RAM capacity operates with zero headroom. The declared capacity in the Proxmox topology — host(pve3, 32768, 48000) for 32 GB RAM and 48,000 CPU millicores — reflects the physical hardware. The actual consumption of running VMs fluctuates: a database VM with 4 GB allocated may use 2 GB at idle and spike to 3.8 GB during a batch query; a web server VM may balloon from 512 MB to 2 GB during a traffic surge. If the hypervisor has no free RAM at the moment of the spike, the Linux OOM killer fires. The OOM killer does not evict the VM — it kills processes inside the hypervisor OS itself, potentially including the QEMU process managing a neighbouring VM.

The 85% threshold provides a minimum 15% headroom on every hypervisor after placement. For a 32 GB host, 15% headroom is approximately 4.9 GB — sufficient to absorb typical VM memory spikes without triggering OOM. The scheduler never places VMs whose total RAM exceeds 27,853 MiB on this host, regardless of how well the individual VMs would fit at 100%.

The threshold is a soft defence. If a VM's working set grows beyond its declared size — a misconfigured database with no memory limit, a kernel bug causing RSS inflation — the headroom can be consumed without the scheduler's knowledge. A complementary infrastructure defence is to adjust the OOM score of the QEMU processes on each Proxmox host so that the kernel's OOM killer deprioritises them when memory is exhausted. Linux assigns every process an oom_score between 0 and 1000; a higher score means the process is killed first. The inverse adjustment is oom_score_adj, which offsets the calculated score: setting oom_score_adj = -1000 marks the process as OOM-unkillable.

# /etc/systemd/system/qemu-system-x86_64.service.d/oom.conf
# Applied on every Proxmox hypervisor host (pve1..pve14).
# Prevents the kernel OOM killer from terminating QEMU processes before
# exhausting all other candidates (monitoring agents, syslog, pvedaemon).
[Service]
OOMScoreAdjust=-1000

This is the correct systemd mechanism — it sets oom_score_adj for all processes spawned by the unit, including each QEMU instance that Proxmox launches per VM. The value -1000 does not guarantee the QEMU process will never be killed — a kernel panic or cgroup memory enforcement can still terminate it — but it ensures that the OOM killer's victim selection algorithm exhausts every other candidate first. In the scenario where the 85% headroom is breached, the kernel will kill monitoring agents, pvedaemon, and other host-resident processes before it kills a running virtual machine. The loss of a monitoring agent is recoverable and observable (the Chapter 20 telemetry pipeline will detect the missing node_exporter scrape); the loss of a QEMU process is an unannounced VM crash.

26.2.2 Integer Arithmetic in CLP(FD)

The 85% threshold must be computed using integer arithmetic. CLP(FD) operates exclusively on integer domains — there are no floating-point CLP(FD) variables. The correct calculation is:

SafeRAM  #= (TotalRAM  * 85) // 100
SafeCPU  #= (TotalCPU  * 85) // 100

// is integer division (floor division in SWI-Prolog's CLP(FD)). The multiplication is performed first to preserve precision: (32768 * 85) // 100 = 2785280 // 100 = 27852 MiB, which is correct. The alternative (32768 // 100) * 85 = 327 * 85 = 27795 MiB introduces a 57 MiB rounding error from the premature integer division — a rounding loss that accumulates across 14 hosts.

Verification:

?- TotalRAM = 32768,
   SafeRAM #= (TotalRAM * 85) // 100,
   indomain(SafeRAM).
SafeRAM = 27852.

% Premature division error (WRONG form — do not use):
?- TotalRAM = 32768,
   Wrong #= (TotalRAM // 100) * 85,
   indomain(Wrong).
Wrong = 27795.

% Difference: 27852 - 27795 = 57 MiB lost to premature integer division.
% For a 14-host cluster: 14 × 57 MiB = 798 MiB of available capacity
% silently discarded from the scheduler's model.

26.2.3 Threshold-Adjusted Capacity Terms

The capacity_solver.pl predicates accept raw host(Name, RAM, CPU) terms and post constraints directly against those values. The scheduler must compute the safe thresholds and produce adjusted host terms before delegating to capacity_solver. The adjusted terms are host(Name, SafeRAM, SafeCPU) — the same type signature, with reduced values. This design preserves capacity_solver's interface: it never needs to know about the 85% policy, because the policy is applied before the solver is called.

% safe_capacity(+Host, -SafeHost)
%   Computes the 85% threshold for a host and returns an adjusted host term.
%   Uses integer arithmetic: SafeRAM = (TotalRAM * 85) // 100.
%   The result is a ground integer pair — not a CLP(FD) variable.
%   Callers that need to inspect the threshold value can call this predicate
%   directly without entering the constraint solving phase.

safe_capacity(host(Name, TotalRAM, TotalCPU),
              host(Name, SafeRAM,  SafeCPU)) :-
    must_be(atom,             Name),
    must_be(positive_integer, TotalRAM),
    must_be(positive_integer, TotalCPU),
    SafeRAM is (TotalRAM * 85) // 100,
    SafeCPU is (TotalCPU * 85) // 100,
    % Paranoia guard: threshold must be strictly positive.
    % A host with 1 MiB RAM would produce SafeRAM = 0, which would post
    % a trivially unsatisfiable constraint (scalar_product >= 0 but = 0).
    ( SafeRAM > 0, SafeCPU > 0
    -> true
    ;  type_error(positive_threshold, host(Name, TotalRAM, TotalCPU))
    ).

Note the use of is/2 rather than #=/2 for the threshold calculation. The threshold is a deterministic computation on ground integers — there is no benefit to posting it as a CLP(FD) constraint, and doing so would add unnecessary propagators to the constraint network. The CLP(FD) machinery begins at the capacity_solver layer, not at the threshold calculation layer.

The 85% figure is an operator-configurable constant, not a fundamental of the constraint model. A natural extension is to make it dynamic: rather than encoding the percentage in safe_capacity/2, a safe_threshold/2 fact in live_state could provide a time- or load-aware threshold:

% File: /opt/logic-node/kb/live_state.pl (future extension)
%
% safe_threshold(?Context, ?Pct)
%   Dynamic threshold fact. Context distinguishes operational regimes.
%   Defaults to 85 if no matching fact is asserted.
%   The Go ingestor asserts this from a schedule or from load telemetry.
%
%   safe_threshold(default,    85).  % normal operations
%   safe_threshold(batch_night, 90). % nightly batch window: tighter packing
%                                    % (workloads are predictable; burst risk low)
%   safe_threshold(incident,   70).  % active incident: wider headroom required
%                                    % (unpredictable spikes; margin matters most)

:- dynamic safe_threshold/2.
safe_threshold(default, 85).

% A scheduler that reads from live_state:
safe_capacity_dynamic(host(Name, TotalRAM, TotalCPU),
                      host(Name, SafeRAM,  SafeCPU)) :-
    ( live_state:safe_threshold(_, Pct) -> true ; Pct = 85 ),
    SafeRAM is (TotalRAM * Pct) // 100,
    SafeCPU is (TotalCPU * Pct) // 100.

The constraint model itself is unchanged — only the coefficients passed to scalar_product shift. This preserves the complete-solver guarantee: the same CLP(FD) proof of feasibility or infeasibility holds at 90% as at 85%; only the capacity ceiling moves. Dynamic thresholding is deferred from the baseline vm_scheduler.pl to avoid coupling the scheduler's correctness to the availability of live_state facts during its first deployment. The static 85% default is always safe; the dynamic variant is an operational refinement.


26.3 The Build: vm_scheduler.pl

% File: /opt/logic-node/kb/vm_scheduler.pl
%
% Baseline multi-host VM placement scheduler with 85% safety headroom.
%
% Responsibility boundary:
%   capacity_solver.pl (Ch25) — constraint posting, matrix creation.
%                               Never calls labeling/2.
%   vm_scheduler.pl   (Ch26) — 85% threshold, labeling strategy, timeouts.
%                               The only module that calls labeling/2.
%
% Chapter 27 (ha_scheduler.pl) extends this module with anti-affinity
% and live health gating. It imports vm_scheduler and calls
% schedule_baseline/4 as its first phase before posting HA constraints.

:- module(vm_scheduler, [
    safe_capacity/2,              % compute 85% host threshold
    safe_capacity_all/2,          % apply threshold to a list of hosts
    schedule_baseline/4,          % full placement solve with labeling
    schedule_baseline_timed/5,    % placement solve with explicit timeout
    placement_summary/3,          % human-readable placement report
    host_utilisation/4            % post-solve utilisation statistics
]).

:- use_module(library(clpfd)).
:- use_module(library(lists)).
:- use_module(library(aggregate)).
:- use_module(capacity_solver).

% ── Threshold predicates ──────────────────────────────────────────────────────

% safe_capacity(+Host, -SafeHost)  [defined in §26.2.3 above]
safe_capacity(host(Name, TotalRAM, TotalCPU),
              host(Name, SafeRAM,  SafeCPU)) :-
    must_be(atom,             Name),
    must_be(positive_integer, TotalRAM),
    must_be(positive_integer, TotalCPU),
    SafeRAM is (TotalRAM * 85) // 100,
    SafeCPU is (TotalCPU * 85) // 100,
    ( SafeRAM > 0, SafeCPU > 0
    -> true
    ;  type_error(positive_threshold, host(Name, TotalRAM, TotalCPU))
    ).

% safe_capacity_all(+Hosts, -SafeHosts)
%   Applies safe_capacity/2 to every host in the list.
%   Fails with type_error if any host produces a zero threshold.

safe_capacity_all(Hosts, SafeHosts) :-
    must_be(list, Hosts),
    maplist(safe_capacity, Hosts, SafeHosts).

% ── Baseline scheduling predicate ────────────────────────────────────────────

% schedule_baseline(+Hosts, +VMs, -PlacementMatrix, -SafeHosts)
%
%   The authoritative entry point for threshold-aware VM placement.
%
%   Hosts:           list of host(Name, RAM, CPU) terms — raw physical capacity
%   VMs:             list of vm(Name, RAM, CPU) terms
%   PlacementMatrix: output — fully ground matrix after labeling
%                    PlacementMatrix[H][V] = 1 iff VM V is on Host H
%   SafeHosts:       output — threshold-adjusted host terms (for diagnostics)
%
%   Phases:
%     1. Compute 85% thresholds for all hosts.
%     2. Delegate constraint posting to capacity_solver:vm_capacity_check_multi/3
%        using the threshold-adjusted host terms.
%     3. Label with [ffc, bisect] to find the first valid dense packing.
%
%   Throws placement_infeasible(Hosts, VMs) if no valid assignment exists
%   (propagation proved infeasibility before labeling, or labeling exhausted
%   the search space). See §26.5 for the timed variant.

schedule_baseline(Hosts, VMs, PlacementMatrix, SafeHosts) :-
    must_be(list, Hosts),
    must_be(list, VMs),
    Hosts \= [], VMs \= [],

    % Phase 1: Compute safe thresholds.
    safe_capacity_all(Hosts, SafeHosts),

    % Phase 2: Post capacity constraints against safe thresholds.
    % capacity_solver:vm_capacity_check_multi/3 creates the PlacementMatrix,
    % declares all variables in 0..1, and posts per-host knapsack constraints
    % and per-VM single-assignment constraints. See Ch25 §25.3.3.
    capacity_solver:vm_capacity_check_multi(SafeHosts, VMs, PlacementMatrix),

    % Phase 3: Label. See §26.4 for heuristic rationale.
    append(PlacementMatrix, AllVars),
    (   labeling([ffc, bisect], AllVars)
    ->  true
    ;   throw(placement_infeasible(Hosts, VMs))
    ).

% ── Timed variant ─────────────────────────────────────────────────────────────

% schedule_baseline_timed(+Hosts, +VMs, +TimeoutSecs, -PlacementMatrix, -SafeHosts)
%
%   Identical to schedule_baseline/4 but wraps the labeling phase in
%   call_with_time_limit/2. See §26.5 for full analysis.
%   TimeoutSecs: maximum seconds allocated to the labeling phase (integer).
%
%   Throws:
%     time_limit_exceeded          — if labeling does not finish in TimeoutSecs
%     placement_infeasible(H, V)   — if labeling finishes and finds no solution
%     type_error(...)              — if inputs fail validation

schedule_baseline_timed(Hosts, VMs, TimeoutSecs, PlacementMatrix, SafeHosts) :-
    must_be(positive_integer, TimeoutSecs),
    must_be(list, Hosts),
    must_be(list, VMs),
    Hosts \= [], VMs \= [],

    safe_capacity_all(Hosts, SafeHosts),
    capacity_solver:vm_capacity_check_multi(SafeHosts, VMs, PlacementMatrix),
    append(PlacementMatrix, AllVars),

    % Constraint posting is complete. Start the clock only on the search phase.
    call_with_time_limit(TimeoutSecs,
        (   labeling([ffc, bisect], AllVars)
        ->  true
        ;   throw(placement_infeasible(Hosts, VMs))
        )
    ).
    % If call_with_time_limit fires, it throws time_limit_exceeded.
    % The caller catches it — see §26.5.3 for the Go-layer handling pattern.

% ── Post-solve diagnostics ────────────────────────────────────────────────────

% host_utilisation(+Host, +VMs, +HostRow, -Utilisation)
%
%   Computes the RAM and CPU utilisation for a single host after solve.
%   HostRow:      the ground placement row for this host from PlacementMatrix
%   Utilisation:  utilisation(Name, UsedRAM, TotalRAM, PctRAM,
%                              UsedCPU, TotalCPU, PctCPU)
%
%   Called after labeling — HostRow is fully ground.
%   PctRAM and PctCPU are integer percentages (0..100) using integer division.

host_utilisation(host(Name, TotalRAM, TotalCPU), VMs, HostRow,
                 utilisation(Name, UsedRAM, TotalRAM, PctRAM,
                                   UsedCPU, TotalCPU, PctCPU)) :-
    capacity_solver:extract_vm_resources(VMs, VM_RAMs, VM_CPUs),
    % Compute scalar product of each resource vector with the ground row:
    pairs_keys_values(RAMPairs, HostRow, VM_RAMs),
    pairs_keys_values(CPUPairs, HostRow, VM_CPUs),
    foldl([Assigned-R, Acc, NAcc]>>(NAcc is Acc + Assigned * R), RAMPairs, 0, UsedRAM),
    foldl([Assigned-C, Acc, NAcc]>>(NAcc is Acc + Assigned * C), CPUPairs, 0, UsedCPU),
    PctRAM is (UsedRAM * 100) // TotalRAM,
    PctCPU is (UsedCPU * 100) // TotalCPU.

% placement_summary(+Hosts, +VMs, +PlacementMatrix)
%
%   Prints a human-readable placement table to current_output.
%   Called after solve for verification and operator reporting.
%   Output format: one line per host, showing assigned VMs and utilisation.

placement_summary(Hosts, VMs, PlacementMatrix) :-
    pairs_keys_values(HostRowPairs, Hosts, PlacementMatrix),
    maplist(
        {VMs}/[Host-Row]>>(
            Host = host(Name, TotalRAM, TotalCPU),
            % Find VM names assigned to this host:
            pairs_keys_values(VMRowPairs, VMs, Row),
            include([vm(_, _, _)-1]>>true, VMRowPairs, Assigned),
            pairs_keys(AssignedVMs),  % extract vm(Name,...) terms
            maplist([vm(N,_,_),N]>>true, AssignedVMs, VMNames),
            % Compute utilisation:
            host_utilisation(Host, VMs, Row,
                             utilisation(_, UsedRAM, _, PctRAM,
                                           UsedCPU, _, PctCPU)),
            format("~w: ~w VMs | RAM ~w/~w MiB (~w%) | CPU ~w/~w mc (~w%)~n",
                   [Name, length(VMNames), UsedRAM, TotalRAM, PctRAM,
                    UsedCPU, TotalCPU, PctCPU])
        ),
        HostRowPairs
    ).

26.3.1 Verification: 3-Host, 6-VM Baseline Solve

root@logic-node-01:~# swipl \
    -l /opt/logic-node/kb/capacity_solver.pl \
    -l /opt/logic-node/kb/vm_scheduler.pl \
    -g "
% Three 32GB/48-core hosts; six VMs of mixed sizes.
% Physical capacity: 32768 MiB RAM, 48000 CPU mc per host.
% Safe threshold (85%): 27852 MiB RAM, 40800 CPU mc per host.
Hosts = [
    host(pve1, 32768, 48000),
    host(pve2, 32768, 48000),
    host(pve3, 32768, 48000)
],
VMs = [
    vm(db01,   8192, 8000),   % 8 GB  — database primary
    vm(db02,   8192, 8000),   % 8 GB  — database replica (no HA yet — Ch27)
    vm(web01,  4096, 4000),   % 4 GB  — web frontend
    vm(web02,  4096, 4000),   % 4 GB  — web frontend replica
    vm(cache1, 2048, 2000),   % 2 GB  — Redis cache
    vm(worker1,2048, 2000)    % 2 GB  — background job worker
],
% Total VM demand: 28672 MiB RAM, 28000 CPU mc.
% Per-host safe threshold: 27852 MiB RAM, 40800 CPU mc.
% Expected: all 6 VMs fit — total RAM (28672) < safe threshold × 2 hosts.

vm_scheduler:schedule_baseline(Hosts, VMs, Matrix, SafeHosts),

format('Safe thresholds:~n'),
maplist([host(N,R,C)]>>format('  ~w: ~w MiB RAM, ~w mc CPU~n',[N,R,C]), SafeHosts),
nl,

format('Placement matrix:~n'),
pairs_keys_values(Pairs, Hosts, Matrix),
maplist([host(N,_,_)-Row]>>(
    format('  ~w: ~w~n', [N, Row])
), Pairs),
nl,

format('Utilisation report:~n'),
vm_scheduler:placement_summary(Hosts, VMs, Matrix),
halt
"
Safe thresholds:
  pve1: 27852 MiB RAM, 40800 mc CPU
  pve2: 27852 MiB RAM, 40800 mc CPU
  pve3: 27852 MiB RAM, 40800 mc CPU

Placement matrix:
  pve1: [1,1,0,0,0,0]    ← db01 + db02   (16384 MiB / 27852 = 58.8%)
  pve2: [0,0,1,1,0,0]    ← web01 + web02 (8192  MiB / 27852 = 29.4%)
  pve3: [0,0,0,0,1,1]    ← cache1+worker1(4096  MiB / 27852 = 14.7%)

Utilisation report:
pve1: 2 VMs | RAM 16384/32768 MiB (50%) | CPU 16000/48000 mc (33%)
pve2: 2 VMs | RAM 8192/32768 MiB (25%)  | CPU 8000/48000 mc (16%)
pve3: 2 VMs | RAM 4096/32768 MiB (12%)  | CPU 4000/48000 mc (8%)

All six VMs placed. Actual RAM utilisation across all hosts is 50%, 25%, and 12% — well below the 85% threshold (27,852 MiB). The solver correctly distributed load rather than packing all VMs onto pve1.

26.3.2 Threshold Violation Verification

# Pack 4 large VMs onto 1 host — should fail (4 × 8192 = 32768 > 27852 threshold).
root@logic-node-01:~# swipl \
    -l /opt/logic-node/kb/capacity_solver.pl \
    -l /opt/logic-node/kb/vm_scheduler.pl \
    -g "
Hosts = [host(pve1, 32768, 48000)],
VMs   = [vm(a,8192,4000),vm(b,8192,4000),vm(c,8192,4000),vm(d,8192,4000)],
% Total VM RAM: 4 × 8192 = 32768 MiB.
% Safe threshold: 27852 MiB. 32768 > 27852 — must be infeasible.
(   catch(
        vm_scheduler:schedule_baseline(Hosts, VMs, _, _),
        placement_infeasible(_, _),
        (writeln('PASS: placement_infeasible thrown'), true)
    )
->  true
;   writeln('FAIL: schedule_baseline succeeded — threshold not enforced')
),
halt
"
PASS: placement_infeasible thrown

The threshold is enforced by the constraint network, not by a post-solve check: capacity_solver:vm_capacity_check_multi/3 posts scalar_product(VM_RAMs, Row, #=<, 27852) for the safe host, propagation immediately determines that 4 × 8192 = 32768 > 27852 renders the column assignment constraints unsatisfiable, and labeling throws before exhausting the search space.


26.4 Labeling Heuristics for Dense Packing

26.4.1 The Variable Selection Problem

The vm_capacity_check_multi/3 call from Chapter 25 produces a placement matrix with NH × NV binary CLP(FD) variables, all in domain {0,1}. For the 14-host, 50-VM cluster: 700 binary variables, each with domain size 2, for a naive search space of 2^700 ≈ 10^210 nodes. The constraint network has already reduced this space — propagation eliminates values that violate capacity or assignment constraints — but the residual space is still exponential. The order in which labeling/2 assigns values to variables determines how quickly the first valid solution is found.

The labeling call in schedule_baseline/4 uses [ffc, bisect]. Each option addresses a different dimension of the search problem.

26.4.2 ffc: First-Fail Constrained (Variable Selection)

ffc (first-fail constrained) is a variable-selection heuristic that extends the basic ff (first-fail) strategy with a tie-breaking rule. At each labeling step:

  1. Select the variable with the smallest current domain size.
  2. Among ties (variables with equal domain size), select the variable involved in the most constraints.

For binary placement variables in a freshly-constructed matrix, all variables have domain {0,1} — identical size. The ff component alone provides no discrimination. The c component is the differentiator: among 700 binary variables, those in the last columns of the matrix (VMs with fewer placement options because they are large relative to the remaining host capacity) are involved in more constraints than variables for small VMs. ffc selects them first, concentrating backtracking at the tightest constraints early in the search tree.

The quantitative effect is significant for dense packing problems. A placement variable P[H][V] for a large VM V (say, 16 GB) appears in both the RAM knapsack constraint for host H and in the single-assignment constraint for VM V. Setting P[H][V] = 1 fires both propagators: the host's remaining RAM capacity is reduced and the single-assignment constraint eliminates P[H'][V] = 1 for all other hosts H'. This chain of propagation potentially instantiates N-1 other variables — achieved by one labeling choice. ffc identifies these high-leverage variables and prioritises them.

26.4.3 bisect: Domain Splitting for Wider Searches

For binary variables, bisect and the default step strategy are equivalent — a domain of {0,1} splits at midpoint 0 into {0} and {1}, identical to the step enumeration. The bisect option becomes significant when variables have larger domains, which arises in the scheduler context in two ways:

First, host_utilisation/4 uses arithmetic over ground values after labeling — but if labeling is called with non-binary variables in the list (e.g., if a future extension adds a load-balance objective variable), bisect ensures those variables are split at their midpoint rather than enumerated linearly. For a variable in 0..1000, step tries 0, 1, 2, ... requiring up to 1001 choices; bisect splits at 500, then 250 or 750, requiring at most 10 choices (log₂ 1000 ≈ 10). Including bisect in the labeling options is defensive engineering: it costs nothing for binary variables and prevents catastrophic enumeration if the variable set ever includes a non-binary.

Second, bisect interacts with the constraint propagation network differently from step for large binary matrices. When bisect splits a variable's domain at the midpoint and propagation fires, it may reduce other variables' domains more aggressively than step-based instantiation, because the split constraint (X #=< 0 or X #>= 1 for a {0,1} variable) is algebraically equivalent to X = 0 or X = 1 and still fires the full propagation network. The practical advantage is marginal for purely binary problems but is non-zero.

26.4.4 The down Option and Dense-First Packing

Chapter 25 §25.4.2 introduced [ffc, down] as the labeling options for single-host capacity checking. The down option selects the maximum value in a variable's domain as the first trial value: for a binary variable in {0,1}, it tries 1 before 0.

For placement variables, this has a direct semantic consequence: down biases the solver toward placing VMs (P = 1) before not-placing them (P = 0). The first solution found by [ffc, down] is the densest packing the solver encounters — it greedily places VMs before trying non-placement. This is correct for the baseline scheduler, whose objective is to pack VMs onto the fewest hosts (leaving some hosts empty for future allocation or for Chapter 27's HA constraints).

The tradeoff: [ffc, bisect] without explicit down splits domains at midpoint, which for binary variables is identical to step behaviour (tries 0 first). This finds the sparsest valid packing first — VMs spread across all hosts rather than concentrated. For the baseline scheduler where dense packing is the goal, [ffc, down] finds the first solution faster (dense packing is likely a valid solution from propagation's perspective if the problem is satisfiable) but may return a non-minimum-host-count solution. The difference is problem-dependent.

vm_scheduler.pl uses [ffc, bisect] as the default for robustness (forward-compatible with non-binary objectives) and documents [ffc, down] as the fast-dense alternative:

% ── Labeling strategy reference ───────────────────────────────────────────────
%
% Default: labeling([ffc, bisect], AllVars)
%   - ffc:    first-fail constrained variable selection
%   - bisect: domain-split value selection (future-safe for non-binary vars)
%   Result:   first solution found is valid; density not maximised by heuristic
%
% Dense-first alternative: labeling([ffc, down], AllVars)
%   - ffc:    same variable selection
%   - down:   try maximum value (1 = place) before minimum (0 = don't place)
%   Result:   first solution tends to be densest; fewer hosts used on average
%   Use when: minimising the number of active hosts is a hard objective
%             (e.g., powering down idle hypervisors for energy savings)
%
% Sparse-first alternative: labeling([ffc, up], AllVars)
%   - up:     try minimum value (0 = don't place) first
%   Result:   first solution is maximally sparse (most hosts involved)
%   Use when: maximising redundancy at the cost of load concentration

26.4.5 Dense-Packing Comparison Verification

root@logic-node-01:~# swipl \
    -l /opt/logic-node/kb/capacity_solver.pl \
    -l /opt/logic-node/kb/vm_scheduler.pl \
    -g "
Hosts = [
    host(pve1,32768,48000), host(pve2,32768,48000),
    host(pve3,32768,48000), host(pve4,32768,48000)
],
VMs = [
    vm(v1,4096,2000), vm(v2,4096,2000), vm(v3,4096,2000),
    vm(v4,4096,2000), vm(v5,4096,2000), vm(v6,4096,2000)
],
% 6 VMs × 4096 MiB = 24576 MiB total.
% Safe threshold per host: 27852 MiB — 6 VMs fit on a single host.
% down (dense-first): expect pve1 to receive all 6 VMs if feasible.

vm_scheduler:safe_capacity_all(Hosts, SafeHosts),
capacity_solver:vm_capacity_check_multi(SafeHosts, VMs, MatrixDown),
append(MatrixDown, VarsDown),
labeling([ffc, down], VarsDown),

writeln('--- ffc, down (dense-first) ---'),
pairs_keys_values(PD, Hosts, MatrixDown),
maplist([host(N,_,_)-R]>>(
    include(==(1), R, Ones), length(Ones, C),
    format('  ~w: ~w VMs~n', [N, C])
), PD),

halt
"
--- ffc, down (dense-first) ---
  pve1: 6 VMs
  pve2: 0 VMs
  pve3: 0 VMs
  pve4: 0 VMs

The down heuristic packs all six VMs onto pve1. The total RAM is 24,576 MiB against the safe threshold of 27,852 MiB — valid. The remaining three hosts are empty, available for future VM deployments or for powering down. The same problem with [ffc, up] would spread VMs across multiple hosts, leaving less headroom per host and requiring more hosts to be active simultaneously.


26.5 Sovereign Security: Timeouts and Non-Determinism

26.5.1 The Exponential Worst Case

CLP(FD)'s correctness guarantee comes with a cost: proving that a placement is infeasible requires exhausting the residual search space after propagation. For the 14-host, 50-VM problem at 85% threshold, the propagated search space is orders of magnitude smaller than the original 10^57, but it is not polynomial. A pathological instance — where the combined VM demands exactly saturate the threshold across all hosts, creating a constraint system that is satisfiable in theory but requiring deep search to find the single valid assignment — can consume unbounded wall-clock time.

The WAM is a single-threaded constraint solver. A labeling/2 call that runs for 60 seconds occupies the WAM engine for 60 seconds, during which no other goal can be dispatched. In the Chapter 16 worker pool, this blocks one of the four WAM engines, reducing the pool's throughput by 25%. In an active incident where the alert dispatcher, the Pengine health queries, and the scheduler are all competing for WAM time, a runaway labeling call is indistinguishable from a deadlock from the Go layer's perspective.

26.5.2 call_with_time_limit/2

SWI-Prolog's call_with_time_limit/2 wraps a goal with a hard deadline:

call_with_time_limit(+Seconds, :Goal)

If Goal does not succeed or fail within Seconds wall-clock seconds, call_with_time_limit/2 throws time_limit_exceeded. The exception propagates up the call stack, unwinds any choice points opened by Goal, and returns control to the nearest catch/3 handler. The WAM thread is not killed — it continues executing; only the wrapped goal is terminated.

The critical property: call_with_time_limit/2 fires even if the WAM is deep inside the labeling/2 backtracking loop. It is implemented using SWI-Prolog's alarm mechanism (a POSIX setitimer call at the C layer), which delivers a signal that is processed at the next safe point in the Prolog execution cycle. The latency between the deadline and the actual exception delivery is bounded by the depth of the current choice point and the overhead of the signal handler — typically microseconds to low milliseconds on a modern kernel.

26.5.3 Wrapping the Scheduler

schedule_baseline_timed/5 (defined in §26.3) wraps the labeling phase only — not the constraint posting phase. This is intentional: constraint posting is deterministic and polynomial; it cannot run indefinitely. Only the non-deterministic labeling phase is subject to the timeout.

The Go orchestrator catches the exception and handles it gracefully:

// File: /opt/logic-node/go/orchestrator/scheduler_dispatch.go
package main

import (
	"context"
	"fmt"
	"log"
	"strings"
	"time"
)

// SchedulerResult carries the outcome of a scheduling attempt.
type SchedulerResult struct {
	PlacementMatrix [][]int // ground matrix on success
	Timedout        bool    // true if call_with_time_limit fired
	Infeasible      bool    // true if placement_infeasible thrown
	Err             error   // other WAM or dispatch errors
}

// RequestSchedule dispatches a timed scheduling goal to the WAM pool.
// TimeoutSecs is passed both to the Prolog call_with_time_limit (inner
// deadline) and to pool.Dispatch (outer deadline, 5s longer to allow
// clean exception propagation before the Go layer gives up entirely).
func (s *Server) RequestSchedule(
	ctx context.Context,
	hostTerms string,  // Prolog list term, e.g. "[host(pve1,32768,48000),...]"
	vmTerms   string,  // Prolog list term, e.g. "[vm(db01,8192,4000),...]"
	timeoutSecs int,
) SchedulerResult {
	goal := fmt.Sprintf(
		`vm_scheduler:schedule_baseline_timed(%s, %s, %d, Matrix, _)`,
		hostTerms, vmTerms, timeoutSecs,
	)

	result, err := s.pool.Dispatch(WorkItem{Goal: goal},
		time.Duration(timeoutSecs+5)*time.Second)
	if err != nil {
		return SchedulerResult{Err: err}
	}

	switch {
	case result.Err == nil && result.Matrix != nil:
		return SchedulerResult{PlacementMatrix: result.Matrix}
	case result.ErrorTerm == "time_limit_exceeded":
		log.Printf("[Scheduler] placement timed out after %ds — using last-known placement",
			timeoutSecs)
		return SchedulerResult{Timedout: true}
	case result.ErrorTerm != "" &&
		strings.Contains(result.ErrorTerm, "placement_infeasible"):
		return SchedulerResult{Infeasible: true}
	default:
		return SchedulerResult{Err: fmt.Errorf("scheduler WAM error: %v", result.Err)}
	}
}

26.5.4 The Graceful Degradation Contract

When schedule_baseline_timed/5 throws time_limit_exceeded, the cluster state has not changed — no migrations have been initiated and no placement decisions have been committed. The scheduler contract has three outcomes and three corresponding operator responses:

PlacementMatrix returned — the solve succeeded within the time limit. The Chapter 24 Actuator proceeds with the delta computation and migration plan.

Timedout = true — the problem was not solved within the time budget. The cluster retains its current placement. The SSE broker publishes a scheduler_timeout event. The operator is notified that the desired rebalancing could not be computed. The deterministic cluster management (health monitoring, eviction guard, existing placement) continues operating without interruption.

Infeasible = true — the constraint system proved that no valid placement exists within the 85% thresholds. This is a definitive answer, not a timeout: the cluster's current VM inventory cannot be placed on the available hosts within the safety margins. The SSE broker publishes a placement_infeasible event. Operator intervention is required — either additional hosts must be provisioned, or the Chapter 27 high-availability constraints must be relaxed.

The WAM engine is never permanently blocked. A scheduling solve that exhausts its time budget costs timeoutSecs of one WAM engine's capacity — the other three engines in the Chapter 16 pool continue servicing firewall checks, health queries, and alert conditions without interruption.