Chapter 21: The PromQL Oracle
The Chapter 20 VictoriaMetrics stack is an instrumented system: 14 hypervisors emit 1,200 metrics each every 15 seconds, and all of it lands in a columnar merge-tree on obs-01 with 12 months of retention — but raw ingest is not reasoning, and a time-series database that nobody queries is an expensive disk heater. This chapter builds the translation layer between stored telemetry and logical assertion: first as a manual discipline — teaching the operator to read CPU steal, ZFS latency, and ARC miss rate in PromQL directly before any automation is introduced — then as a Prolog meta-interpreter, the Oracle, that constructs type-safe, injection-hardened PromQL strings from ground terms and dispatches them against the VictoriaMetrics HTTP API, producing the structured JSON that Chapter 22 will parse and assert as node_metric/4 facts into the live WAM.
21.1 Functional Telemetry: The Anatomy of PromQL
21.1.1 Instant Vectors and Range Vectors
Every PromQL expression operates on one of two fundamental data types. An instant vector is a set of time-series samples evaluated at a single point in time — each element is a metric name with a label set and one floating-point value. A range vector is a set of time-series samples evaluated over a sliding window of time — each element is a metric name with a label set and a sequence of (timestamp, value) pairs spanning the window duration.
The distinction is not stylistic — it determines which functions are applicable and what the query engine returns.
Instant Vector — evaluated at a single timestamp t:
node_memory_MemAvailable_bytes{instance="pve3:9100"} → 8589934592.0
node_memory_MemAvailable_bytes{instance="pve7:9100"} → 4294967296.0
Query: node_memory_MemAvailable_bytes
Result type: instant vector
Applicable functions: arithmetic operators, label_replace(), topk(), bottomk()
Not applicable: rate(), irate(), increase(), delta()
Range Vector — a sliding window of samples:
node_cpu_seconds_total{mode="steal",instance="pve3:9100"}[5m]
→ [(t-300, 1847.2), (t-285, 1847.6), ..., (t-0, 1862.9)]
Query: node_cpu_seconds_total{mode="steal"}[5m]
Result type: range vector
Required input for: rate(), irate(), increase(), delta()
Not directly graphable — must be wrapped in a function first
Gauges — metrics whose value can increase or decrease arbitrarily, such as node_memory_MemAvailable_bytes, node_filesystem_avail_bytes, or node_load1 — are instant vectors. They require no rate calculation. The sample at time t is the measurement at time t. Querying node_memory_MemAvailable_bytes{instance="pve3:9100"} returns the exact number of bytes the kernel reports available at the query timestamp.
Counters — metrics that only increase, reset to zero on process restart, and represent a cumulative total, such as node_cpu_seconds_total, node_network_receive_bytes_total, or node_disk_reads_completed_total — are meaningless as instant vectors. The raw counter value at time t is the total accumulated since the process started; the number 1,862,900 for node_cpu_seconds_total{mode="steal"} tells you only that this CPU has been stolen for approximately 21 days of cumulative time, not what the steal rate is right now. Counter values only become operationally meaningful when wrapped in rate() or irate() over a range vector, which computes the per-second rate of increase across the window.
21.1.2 rate() vs irate() on Counters
rate(v[d]) computes the average per-second increase rate of a counter over the range window d. It uses the first and last samples in the window and divides by the elapsed time between them, then applies a linear correction for counter resets (wrapping from a large value back to zero). The result is a smoothed rate: a 5-minute rate() window averages out momentary spikes that occur within the 5-minute interval.
irate(v[d]) computes the per-second increase rate using only the last two samples in the range window, regardless of window size. The window size exists only to control how far back the query engine looks for those two samples — it does not affect the computation. The result is an instantaneous rate: it reflects the delta between the most recent 15-second scrape interval pair. Spikes that are averaged away in rate([5m]) are fully visible in irate([5m]).
Counter: node_network_receive_bytes_total{instance="pve3:9100",device="eth0"}
Samples in the last 5m window:
t-300s: bytes=109_837_204_480
t-285s: bytes=109_837_452_120
t-270s: bytes=109_837_712_840
...
t-15s: bytes=109_843_891_200
t-0s: bytes=109_844_148_820
rate([5m]) = (109_844_148_820 - 109_837_204_480) / 300s
= 6_944_340 bytes/s
≈ 6.94 MB/s average over 5 minutes
irate([5m]) = (109_844_148_820 - 109_843_891_200) / 15s
= 257_620 bytes/s
≈ 0.25 MB/s at this instant
The two values diverge substantially during transient traffic bursts.
rate([5m]) is appropriate for capacity dashboards and alert thresholds.
irate([5m]) is appropriate for detecting sub-minute throughput spikes.
For the sovereign cluster's network metrics, the choice depends on the signal being monitored. Sustained throughput trends that inform capacity planning use rate([5m]) — the smoothing is a feature, not a deficiency. Frame drop detection and burst detection use irate([1m]) or irate([5m]) — the instantaneous nature surfaces the spike before it vanishes into a 5-minute average.
21.1.3 CPU Steal: The Primary Oversubscription Signal
CPU steal (mode="steal" in node_cpu_seconds_total) is the time a virtual machine's vCPU spent waiting for a real CPU cycle that the hypervisor scheduler could not provide because the physical core was assigned to another VM. On a correctly provisioned Proxmox cluster, steal is near zero — a hypervisor with adequate physical cores relative to vCPU count delivers scheduled time on demand. Steal above 10% for more than two consecutive scrape intervals indicates the physical host is oversubscribed: more vCPUs are demanding CPU time than physical threads exist to serve them, and VMs are losing scheduled time to other tenants.
node_cpu_seconds_total is a per-CPU-per-mode counter. On a 32-thread Proxmox host, there are 32 × N mode entries (user, system, idle, steal, iowait, irq, softirq, nice). To compute the total steal percentage across all CPUs for a single hypervisor instance:
# Steal rate per CPU, per instance — raw irate:
irate(node_cpu_seconds_total{mode="steal"}[5m])
# Sum across all CPUs for one instance, then average by instance:
# The avg() is required because sum() gives total steal-seconds/s,
# not a percentage. avg() gives fractional steal per vCPU (0.0–1.0).
avg(irate(node_cpu_seconds_total{mode="steal"}[5m])) by (instance)
# As a percentage (0–100):
avg(irate(node_cpu_seconds_total{mode="steal"}[5m])) by (instance) * 100
# Across all instances — the cluster-wide steal heat map:
avg(irate(node_cpu_seconds_total{mode="steal"}[5m])) by (instance) * 100
sum(irate(node_cpu_seconds_total{mode="steal"}[5m])) by (instance) produces steal-seconds per second summed across all CPUs, which is a dimensionally consistent number but not a percentage. A 32-CPU host with all CPUs stealing 5% of time produces a sum() of 1.6 (32 × 0.05), not 5%. The avg() produces 0.05, which multiplied by 100 gives the operationally meaningful 5%. The Chapter 22 Prolog assertion cpu_steal_valid(Node, Value) validates Value in the range 0.0–100.0 and applies a delta guard; the query that produces Value must use avg(), not sum().
irate([5m]) is correct for steal rather than rate([5m]) because steal is a symptom of momentary scheduling pressure. A 5-minute rate() average hides a 30-second spike where steal reaches 80% — which is precisely the condition that causes VM latency complaints — by averaging it with five minutes of near-zero. irate([5m]) with a 5-minute window ensures the query engine can always find two samples even if a scrape fails, while computing the rate over the most recent 15-second interval.
21.2 Visual to Logical: Grafana Interpretation
21.2.1 The Visual Debugging Methodology
Before building the Oracle, the operator must develop the discipline of reading telemetry patterns manually, because the Oracle's query set is derived from what the operator already knows is meaningful to monitor. An Oracle that queries the wrong metrics with correct syntax is operationally useless.
The methodology for visual debugging has three steps: identify the symptom panel that shows the anomaly, identify the metric and label that panel's query is built from, then construct a PromQL query that expresses that anomaly as a scalar threshold crossable by the logic engine. The dashboard is the discovery layer; PromQL is the formalisation layer; Prolog is the assertion layer.
21.2.2 ZFS I/O Latency and ARC Miss Rate
ZFS exposes I/O timing through two related metric families. node_disk_io_time_seconds_total is a counter of total time spent doing I/O across all operations (reads and writes combined). node_disk_read_time_seconds_total and node_disk_write_time_seconds_total are per-operation time counters. The ratio of operation time to operation count gives average latency per operation.
# Total I/O utilisation (percentage of time the disk was busy) — a gauge-like
# interpretation of a counter rate:
rate(node_disk_io_time_seconds_total{instance="pve3:9100",device="sda"}[5m]) * 100
# Result: percentage of wall-clock time the device was performing I/O.
# At 100% the device is saturated — all available I/O slots are occupied.
# Healthy NVMe: < 30% sustained. Alert threshold: > 70% for > 2 minutes.
# Average read latency per operation:
rate(node_disk_read_time_seconds_total{instance="pve3:9100",device="sda"}[5m])
/
rate(node_disk_reads_completed_total{instance="pve3:9100",device="sda"}[5m])
# Result: seconds per read operation. Multiply by 1000 for milliseconds.
# Healthy ZFS NVMe: < 0.5ms average read latency.
# Alert threshold: > 5ms sustained (10× normal = likely hardware degradation).
# The diagnostic pair: io_time vs read_time separates utilisation from latency.
# High io_time + low read_latency = high throughput, healthy device.
# High io_time + high read_latency = saturation or hardware failure.
# Low io_time + high read_latency = firmware issue or intermittent fault.
ZFS ARC (Adaptive Replacement Cache) miss rate is the ratio of L2ARC or disk reads to total cache lookups. node_exporter does not natively expose ZFS ARC statistics on Linux without the zfs collector enabled explicitly. With --collector.zfs, the following metrics are available:
# ARC hit and miss counters (Linux ZFS via zfs_stats):
node_zfs_arc_hits_total
node_zfs_arc_misses_total
# ARC miss rate — the percentage of cache lookups that required a disk read:
rate(node_zfs_arc_misses_total{instance="pve3:9100"}[5m])
/
( rate(node_zfs_arc_hits_total{instance="pve3:9100"}[5m])
+ rate(node_zfs_arc_misses_total{instance="pve3:9100"}[5m])
) * 100
# Result: percentage of lookups that missed the ARC.
# A cold ARC after reboot: > 80% miss rate, declining over 30–60 minutes.
# Steady-state healthy: < 5% miss rate for a working set that fits in ARC.
# Alert threshold: > 20% sustained miss rate after warm-up indicates ARC
# undersized for the working set, or a read pattern that the ARC algorithm
# cannot optimise (fully random large-block reads).
A Grafana panel showing a ZFS ARC miss rate spike without a corresponding io_time spike indicates a sudden change in access pattern — a new VM workload with a different read footprint, not a hardware failure. The visual debugging methodology distinguishes these: hardware failure produces correlated spikes across io_time, read_latency, and error_counts; workload change produces an isolated arc_miss spike with normal latency and zero error counts.
One dataset configuration detail affects the interpretability of node_disk_io_time_seconds_total for a VictoriaMetrics storage directory. By default, ZFS updates access time metadata (atime) on every file read and stores extended attributes (xattr) in separate hidden files. Both operations generate filesystem metadata I/O that the kernel's block layer counts toward io_time — the scrape includes the metadata write as device utilisation even though no metric data moved. The Chapter 20 ZFS dataset already sets atime=off and xattr=sa explicitly; the consequence for monitoring is that node_disk_io_time_seconds_total on vmpool/victoria-metrics reflects only actual data movement: merge flush writes, background compaction reads, and PromQL query reads. A node_disk_io_time_seconds_total reading on this dataset that rises from 15% to 60% is unambiguously a data I/O event — a compaction wave, a bulk query, or a write throughput spike — not filesystem housekeeping noise inflating the utilisation number. Without atime=off, the merge tree's frequent part file opens during query execution generate a constant trickle of atime update writes that superimpose a background noise floor on the metric, making threshold-based alerting less precise.
21.2.3 The Metric Pair Diagnostic for Hardware Failure
Hardware failure in a storage device produces a characteristic pattern across two distinct metric families that must be read together. node_disk_io_time_seconds_total measures time spent in I/O; node_disk_read_errors_total and node_disk_write_errors_total (via the diskstats collector) measure I/O errors returned by the device.
# I/O error rate — should be exactly zero on healthy hardware:
rate(node_disk_read_errors_total{instance="pve3:9100"}[5m])
rate(node_disk_write_errors_total{instance="pve3:9100"}[5m])
# Any non-zero value here is a hardware alert, not a threshold crosssing.
# One read error per 5-minute window = pending NVMe sector reallocation.
# Sustained errors = imminent device failure.
# The diagnostic pair for hardware failure vs. workload saturation:
# Condition A: io_time high, errors zero → device is busy but healthy
# Condition B: io_time normal, errors > 0 → device is failing (write errors,
# ECC corrections not yet impacting throughput)
# Condition C: io_time high, errors > 0 → device is failing under load
# Condition D: io_time spikes, latency > 5ms, errors escalating → replace immediately
# ZFS adds a layer: even with device-level errors, ZFS resilver covers Condition B/C
# via RAID-Z or mirror redundancy. The Oracle must assert both the error condition
# AND the ZFS resilver status to correctly classify node health.
The Oracle constructed in §21.3 generates queries for both halves of this pair atomically — io_time and error_count are requested in the same query batch, and the Prolog rule storage_degraded/2 requires both values to be asserted before it produces a verdict. A node with elevated io_time but zero errors is classified storage_busy, not storage_degraded. A node with non-zero errors at any io_time level is classified storage_degraded immediately and referred to the human operator for physical inspection.
21.3 The Build: The Prolog PromQL Generator
21.3.1 Design Constraints
A closed-vocabulary generator that can only produce queries for metric names and label keys explicitly declared in its knowledge base: an undeclared metric name causes the predicate to fail at the known_metric/2 check before any string assembly begins, an undeclared label key causes the predicate to fail at the known_label/1 check, and the resulting PromQL string is assembled from validated atoms only — no user-supplied string ever reaches the string concatenation layer.
The metric type declaration (counter or gauge) drives the wrapping function selection. A counter metric is always wrapped in irate/2 or rate/2 before being returned; a gauge metric is returned bare. The distinction is enforced structurally by separate clauses for build_metric_expr/4, not by a runtime conditional that could be bypassed.
21.3.2 Metric and Label Vocabulary
% File: /opt/logic-node/kb/promql_oracle.pl
:- module(promql_oracle, [
promql_query/4,
promql_query_range/5,
known_metric/2,
known_label/1
]).
% ── Metric vocabulary ─────────────────────────────────────────────────────────
%
% known_metric(+Name, +Type)
% Name: the exact metric name as scraped by node_exporter.
% Type: counter | gauge
%
% This is the closed vocabulary. promql_query/4 fails for any metric name
% not declared here. Adding a new metric requires an explicit knowledge base
% edit — it cannot be done at runtime via user input.
known_metric(node_cpu_seconds_total, counter).
known_metric(node_memory_MemAvailable_bytes, gauge).
known_metric(node_memory_MemTotal_bytes, gauge).
known_metric(node_disk_io_time_seconds_total, counter).
known_metric(node_disk_read_time_seconds_total, counter).
known_metric(node_disk_write_time_seconds_total, counter).
known_metric(node_disk_reads_completed_total, counter).
known_metric(node_disk_writes_completed_total, counter).
known_metric(node_disk_read_errors_total, counter).
known_metric(node_disk_write_errors_total, counter).
known_metric(node_network_receive_bytes_total, counter).
known_metric(node_network_transmit_bytes_total, counter).
known_metric(node_network_receive_drop_total, counter).
known_metric(node_network_transmit_drop_total, counter).
known_metric(node_zfs_arc_hits_total, counter).
known_metric(node_zfs_arc_misses_total, counter).
known_metric(node_zfs_arc_size, gauge).
known_metric(node_load1, gauge).
known_metric(node_load5, gauge).
known_metric(node_load15, gauge).
known_metric(up, gauge).
% ── Label vocabulary ─────────────────────────────────────────────────────────
%
% known_label(+Key)
% Only these label keys are permitted in generated label matchers.
% The value for each label key is validated separately by build_label/3.
known_label(instance).
known_label(mode).
known_label(device).
known_label(job).
known_label(cpu).
% ── Permitted label values ────────────────────────────────────────────────────
%
% known_label_value(+Key, +Value)
% For closed-set label keys (mode, job), enumerate valid values.
% For open-set keys (instance, device, cpu), delegate to
% known_node/1 (Chapter 19) or a device vocabulary.
known_label_value(mode, steal).
known_label_value(mode, idle).
known_label_value(mode, user).
known_label_value(mode, system).
known_label_value(mode, iowait).
known_label_value(mode, irq).
known_label_value(mode, softirq).
known_label_value(mode, nice).
known_label_value(job, hypervisors).
known_label_value(job, 'victoria-metrics-self').
21.3.3 Label Matcher Construction
% build_label_matchers(+LabelList, -MatcherString)
% LabelList: list of Key=Value pairs, e.g. [instance="pve3:9100", mode="steal"]
% MatcherString: the PromQL label selector body, e.g. "instance=\"pve3:9100\",mode=\"steal\""
%
% Fails if any Key is not in known_label/1.
% Fails if any Value fails validation for its Key.
% The empty list produces an empty string (bare metric, no label filter).
build_label_matchers([], "").
build_label_matchers([Key=Value | Rest], Result) :-
known_label(Key),
validate_label_value(Key, Value),
build_label_matchers(Rest, RestStr),
atom_string(Key, KeyStr),
atom_string(Value, ValStr),
( RestStr == ""
-> format(string(Result), "~w=\"~w\"", [KeyStr, ValStr])
; format(string(Result), "~w=\"~w\",~w", [KeyStr, ValStr, RestStr])
).
% validate_label_value(+Key, +Value)
% For closed-set keys: checks known_label_value/2.
% For 'instance': checks proxmox_topology:known_node/1 after stripping :PORT suffix.
% For 'device': checks known_device/1 (storage device vocabulary).
% For 'cpu': checks that Value is a non-negative integer atom.
validate_label_value(Key, Value) :-
known_label_value(Key, _),
!,
known_label_value(Key, Value).
validate_label_value(instance, Value) :-
!,
atom_string(Value, VStr),
( sub_string(VStr, Before, _, _, ":")
-> sub_string(VStr, 0, Before, _, NodeStr),
atom_string(NodeAtom, NodeStr)
; atom_string(NodeAtom, VStr)
),
proxmox_topology:known_node(NodeAtom).
validate_label_value(device, Value) :-
!,
known_device(Value).
validate_label_value(cpu, Value) :-
!,
atom_number(Value, N),
integer(N),
N >= 0.
% known_device/1 — storage device vocabulary.
% Enumerate the NVMe and SATA devices present in the cluster.
% Extends the same closed-vocabulary principle to device identifiers.
known_device(nvme0n1).
known_device(nvme1n1).
known_device(sda).
known_device(sdb).
21.3.4 Metric Expression Builder and Query Assembly
% build_metric_expr(+MetricName, +Type, +LabelStr, +Range, -Expr)
% Constructs the innermost PromQL expression for a metric, selecting
% the correct wrapping function based on Type.
%
% Counter: irate(metric{labels}[range])
% Gauge: metric{labels} (no range vector, no function)
%
% Range is an atom like '5m', '1m', '30s'.
% For gauges, Range is ignored — gauges do not require a range vector.
build_metric_expr(MetricName, counter, LabelStr, Range, Expr) :-
atom_string(MetricName, MStr),
atom_string(Range, RStr),
( LabelStr == ""
-> format(string(Expr), "irate(~w[~w])", [MStr, RStr])
; format(string(Expr), "irate(~w{~w}[~w])", [MStr, LabelStr, RStr])
).
build_metric_expr(MetricName, gauge, LabelStr, _Range, Expr) :-
atom_string(MetricName, MStr),
( LabelStr == ""
-> Expr = MStr
; format(string(Expr), "~w{~w}", [MStr, LabelStr])
).
% promql_query(+Metric, +Labels, +Range, -QueryString)
% The primary public predicate of the Oracle.
%
% Metric: atom — must be in known_metric/2
% Labels: list of Key=Value pairs — each Key must be in known_label/1,
% each Value must pass validate_label_value/2
% Range: atom — the range window for counter metrics ('5m', '1m', '30s')
% ignored for gauge metrics
% QueryString: string — the complete, safe PromQL query expression
%
% Fails immediately if Metric is not in known_metric/2.
% Fails immediately if any label fails validation.
% Never produces a QueryString from unvalidated input.
promql_query(Metric, Labels, Range, QueryString) :-
known_metric(Metric, Type),
build_label_matchers(Labels, LabelStr),
build_metric_expr(Metric, Type, LabelStr, Range, QueryString).
% promql_query_range/5 — generates a query for /api/v1/query_range
% identical to promql_query/4 but the QueryString is intended for
% time-range queries. The range in the PromQL expression ([5m]) is the
% irate/rate window; the HTTP request's start/end parameters (set by
% the caller) control the time range of the query response.
promql_query_range(Metric, Labels, IrateWindow, _TimeRange, QueryString) :-
promql_query(Metric, Labels, IrateWindow, QueryString).
21.3.5 Aggregate Query Predicates
% cpu_steal_query(+Instance, -QueryString)
% Generates the canonical CPU steal percentage query for a single instance.
% Uses avg() by (instance) to normalise across all CPUs to a 0–100 percentage.
cpu_steal_query(Instance, QueryString) :-
promql_query(
node_cpu_seconds_total,
[instance=Instance, mode=steal],
'5m',
InnerExpr
),
format(string(QueryString),
"avg(~w) by (instance) * 100",
[InnerExpr]).
% memory_available_pct_query(+Instance, -QueryString)
% Generates a query for available memory as a percentage of total.
% Requires two sub-queries combined arithmetically.
memory_available_pct_query(Instance, QueryString) :-
promql_query(node_memory_MemAvailable_bytes, [instance=Instance], '5m', AvailExpr),
promql_query(node_memory_MemTotal_bytes, [instance=Instance], '5m', TotalExpr),
format(string(QueryString),
"(~w / ~w) * 100",
[AvailExpr, TotalExpr]).
% disk_io_utilisation_query(+Instance, +Device, -QueryString)
% Generates a query for disk I/O utilisation as a percentage of wall time.
disk_io_utilisation_query(Instance, Device, QueryString) :-
promql_query(
node_disk_io_time_seconds_total,
[instance=Instance, device=Device],
'5m',
Expr
),
format(string(QueryString), "~w * 100", [Expr]).
% disk_read_latency_query(+Instance, +Device, -QueryString)
% Generates a query for average read latency in milliseconds.
disk_read_latency_query(Instance, Device, QueryString) :-
promql_query(
node_disk_read_time_seconds_total,
[instance=Instance, device=Device],
'5m',
TimeExpr
),
promql_query(
node_disk_reads_completed_total,
[instance=Instance, device=Device],
'5m',
CountExpr
),
format(string(QueryString),
"(~w / ~w) * 1000",
[TimeExpr, CountExpr]).
% zfs_arc_miss_rate_query(+Instance, -QueryString)
% Generates a query for ARC miss rate as a percentage.
zfs_arc_miss_rate_query(Instance, QueryString) :-
promql_query(node_zfs_arc_misses_total, [instance=Instance], '5m', MissExpr),
promql_query(node_zfs_arc_hits_total, [instance=Instance], '5m', HitExpr),
format(string(QueryString),
"(~w / (~w + ~w)) * 100",
[MissExpr, HitExpr, MissExpr]).
21.3.6 Query String Caching for High-Frequency Polling
The Chapter 22 fact assertion loop calls the Oracle once per metric per node per polling cycle — at a 15-second scrape interval with 14 nodes and 6 metric families, that is 84 format/3 string constructions per cycle, 5.6 per second sustained. Each call traverses the vocabulary checks, builds the label matcher string, and constructs the final query string via format/3. The vocabulary checks against known_metric/2 and known_label/1 are O(1) fact lookups; format/3 itself is the dominant cost: it allocates a new string on the Prolog heap, performs character-level substitution for each ~w slot, and returns a fresh heap-allocated string on every call.
For the polling frequencies used in Chapter 22 this is not a bottleneck. At hypothetical sub-second polling rates — a Chapter 23 Pengine Harvester monitoring a degraded node at 1-second intervals — the repeated format/3 allocation for the same ground query becomes measurable. Since the labels are ground atoms, the same (Metric, Labels, Range) triple always produces the same QueryString; the computation is pure and referentially transparent. Caching the result on first call and returning it on subsequent calls eliminates all format/3 overhead for repeat queries:
% cached_promql(+Metric, +Labels, +Range, -QueryString)
% Returns a cached PromQL query string for the given ground arguments,
% computing and caching it on the first call.
%
% The cache key is the canonical term representation of (Metric,Labels,Range).
% nb_setval/nb_getval use a global non-backtrackable cell — the cache
% persists across backtracking and across calls within the same WAM session.
% It does not persist across WAM restarts (KB reloads flush it implicitly).
%
% Thread safety: nb_setval/nb_getval are atomic on SWI-Prolog's
% shared-memory multi-engine architecture. Two engines racing to populate
% the same cache key will both call promql_query/4 and both write the same
% (identical) value. The second write is idempotent; no locking is required.
:- dynamic cached_promql/4.
cached_promql(Metric, Labels, Range, QueryString) :-
cached_promql_stored(Metric, Labels, Range, QueryString),
!.
cached_promql(Metric, Labels, Range, QueryString) :-
promql_query(Metric, Labels, Range, QueryString),
assertz(cached_promql_stored(Metric, Labels, Range, QueryString)).
:- dynamic cached_promql_stored/4.
% flush_promql_cache/0
% Clears all cached query strings. Must be called after any KB mutation
% that changes the node vocabulary or device vocabulary, since cached
% strings embedding node names or device names from the old vocabulary
% may reference nodes that have since been retracted.
%
% The Chapter 22 polling loop calls flush_promql_cache/0 on receipt of
% a kb_updated SSE event from the Go server (Chapter 19 §19.5.1) before
% the next polling cycle begins.
flush_promql_cache :-
retractall(cached_promql_stored(_, _, _, _)).
The cached_promql_stored/4 dynamic predicate uses Prolog's native first-argument indexing: lookups on a ground Metric atom are O(1) hash dispatch into the clause index, not a linear scan. The explicit ! cut after the stored-result clause prevents the fall-through to promql_query/4 on the first call per key pair. flush_promql_cache/0 is a single retractall/1 call — it runs in O(N) where N is the number of cached entries, completing in under 1ms for 84 entries.
21.3.7 Oracle Query Verification
# Load the Oracle in SWI-Prolog and verify the generated query strings.
root@logic-node-01:~# swipl -l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/promql_oracle.pl \
-g "
promql_oracle:cpu_steal_query('pve3:9100', Q),
writeln(Q),
halt
"
avg(irate(node_cpu_seconds_total{instance="pve3:9100",mode="steal"}[5m])) by (instance) * 100
root@logic-node-01:~# swipl -l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/promql_oracle.pl \
-g "
promql_oracle:disk_read_latency_query('pve3:9100', nvme0n1, Q),
writeln(Q),
halt
"
(irate(node_disk_read_time_seconds_total{instance="pve3:9100",device="nvme0n1"}[5m]) / irate(node_disk_reads_completed_total{instance="pve3:9100",device="nvme0n1"}[5m])) * 1000
# Verify that an unknown metric causes immediate failure — no partial string produced:
root@logic-node-01:~# swipl -l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/promql_oracle.pl \
-g "
( promql_oracle:promql_query(malicious_exec, [], '5m', _)
-> writeln('FAIL: unknown metric accepted')
; writeln('PASS: unknown metric rejected')
),
halt
"
PASS: unknown metric rejected
21.4 The Build: VictoriaMetrics API Interface
21.4.1 The /api/v1/query and /api/v1/query_range Endpoints
VictoriaMetrics exposes a Prometheus-compatible HTTP query API on port 8428. Two endpoints are relevant to the Oracle's dispatch layer.
/api/v1/query evaluates a PromQL expression at a single timestamp. The response contains one result per time-series that matches the expression. This is the correct endpoint for the Chapter 22 fact assertion loop, which needs the current value of each metric at the time of the logic engine's polling cycle.
/api/v1/query_range evaluates a PromQL expression over a time range, returning a matrix of (timestamp, value) pairs for each matching time-series. This is the correct endpoint for the Chapter 23 Pengine Harvester's trend analysis, which needs to detect whether CPU steal has been elevated for more than two consecutive scrape windows. The Oracle's promql_query_range/5 predicate generates the query expression; the HTTP parameters (start, end, step) are set by the caller.
21.4.2 /api/v1/query Response Schema
# Instant query — current CPU steal for all hypervisors:
root@logic-node-01:~# curl -s \
"http://10.40.0.2:8428/api/v1/query?query=avg(irate(node_cpu_seconds_total%7Bmode%3D%22steal%22%7D%5B5m%5D))%20by%20(instance)%20*%20100" \
| python3 -m json.tool
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"instance": "pve1:9100"
},
"value": [
1741267200,
"1.234567890123456"
]
},
{
"metric": {
"instance": "pve3:9100"
},
"value": [
1741267200,
"47.891234567890120"
]
},
{
"metric": {
"instance": "pve7:9100"
},
"value": [
1741267200,
"3.012345678901234"
]
}
]
}
}
The response structure is fixed and documented here for Chapter 22's parser. The relevant fields: data.resultType is always "vector" for instant queries and "matrix" for range queries. Each element of data.result is an object with a metric sub-object (containing the label key-value pairs that identify the time-series) and a value array of exactly two elements: a Unix timestamp integer and a string-encoded float. The float is always a string, never a JSON number — this is intentional in the Prometheus API specification to preserve floating-point precision for very small and very large values that JSON number encoding would round.
Chapter 22's parse_vm_response/3 predicate must handle the string-encoded float explicitly with atom_to_term/3 or number_string/2, not with a bare read_term/2 call.
21.4.3 /api/v1/query_range Response Schema
# Range query — CPU steal for pve3 over the last 5 minutes at 15s resolution:
root@logic-node-01:~# curl -s \
"http://10.40.0.2:8428/api/v1/query_range?\
query=avg(irate(node_cpu_seconds_total%7Binstance%3D%22pve3%3A9100%22%2Cmode%3D%22steal%22%7D%5B5m%5D))%20by%20(instance)%20*%20100\
&start=1741266900\
&end=1741267200\
&step=15" \
| python3 -m json.tool
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"instance": "pve3:9100"
},
"values": [
[1741266900, "44.221098765432100"],
[1741266915, "45.891234567890120"],
[1741266930, "46.012345678901230"],
[1741266945, "46.334567890123450"],
[1741266960, "47.112345678901230"],
[1741266975, "47.445678901234560"],
[1741266990, "47.667890123456780"],
[1741267005, "47.778901234567890"],
[1741267020, "47.891234567890120"],
[1741267035, "47.889012345678900"],
[1741267050, "47.901234567890120"],
[1741267065, "47.890123456789010"],
[1741267080, "47.891234567890120"],
[1741267095, "47.892345678901230"],
[1741267110, "47.878901234567890"],
[1741267125, "47.891234567890120"],
[1741267140, "47.845678901234560"],
[1741267155, "47.901234567890120"],
[1741267170, "47.889012345678900"],
[1741267185, "47.891234567890120"],
[1741267200, "47.891234567890120"]
]
}
]
}
}
For matrix results, data.result[*].values is an array of [timestamp, valueString] pairs rather than the single value pair of the instant query. The Chapter 22 parser's parse_vm_range_response/3 predicate must iterate this array and produce a list of ts(Timestamp, Value) terms for the rate-of-change guard in cpu_steal_valid/2.
The pve3 output above shows a 20-sample sequence where steal holds at 47.8–47.9% across a 5-minute window — 20 consecutive samples above the 40% cpu_steal_valid/2 delta floor. This is the ground truth the Chapter 22 logic engine will read and act on.
21.5 Sovereign Security: Query Sanitisation
21.5.1 The PromQL Injection Attack Surface
PromQL injection is the class of attack where attacker-controlled strings reach the label matcher construction layer of a query generator without vocabulary validation. The consequences are distinct from SQL injection — PromQL has no DROP TABLE or EXEC — but the information disclosure and denial-of-service surfaces are real.
The label matcher position in a PromQL query is structurally analogous to the WHERE clause in SQL. A naive template-based generator that interpolates a user-supplied node name directly into a query string:
% DANGEROUS — do not implement:
naive_cpu_query(Instance, QueryString) :-
format(string(QueryString),
"avg(irate(node_cpu_seconds_total{instance=\"~w\",mode=\"steal\"}[5m])) by (instance) * 100",
[Instance]).
accepts any string as Instance. An attacker who controls the Instance argument — via an HTTP API that accepts a node name parameter, or via a compromised Go handler that forwards a request body field — can supply:
pve3:9100",job=~".*
which produces:
avg(irate(node_cpu_seconds_total{instance="pve3:9100",job=~".*",mode="steal"}[5m])) by (instance) * 100
The injected job=~".*" is a valid PromQL label matcher using the regex equality operator =~. This particular injection is benign — it adds a wildcard job matcher that selects all jobs. But the same technique allows:
# Cardinality explosion — force the query engine to evaluate O(N²) series:
",__name__=~".+"
# Exfiltrate all metric names via a crafted label regex:
",job=~"(node_exporter|victoria-metrics)"
# Bypass the instance filter entirely — return all instances:
"} offset 9999d # comment
The last example demonstrates label matcher termination via } injection: the injected } closes the label block prematurely, and the trailing offset 9999d shifts the query evaluation window 9,999 days into the past, returning no data (a silent denial of service) or triggering an error that leaks internal TSDB state.
21.5.2 The Oracle's Injection Immunity
The Oracle is immune to PromQL injection by construction, not by sanitisation. The distinction is architectural: sanitisation-based defences inspect a user-supplied string and attempt to remove dangerous characters or patterns — a perpetually incomplete approach because PromQL's grammar is rich enough that new injection vectors are discovered after sanitisation rules are written. Construction-based defences never accept arbitrary strings as label values; every component of the generated query is a ground Prolog term drawn from a declared vocabulary.
The validate_label_value(instance, Value) predicate in §21.3.3 does not check Value for the characters ", }, {, =, ~, #, or any other PromQL metacharacter. It does not need to — it calls proxmox_topology:known_node(NodeAtom) against the live WAM clause database. If NodeAtom is not a node declared in the topology KB, the predicate fails and build_label_matchers/2 fails and promql_query/4 fails. The string "pve3:9100\",job=~\".*" cannot be constructed as a NodeAtom because it is not an atom that unifies with any known_node/1 fact — it is a compound structure that would require string parsing before it could even be presented to known_node/1, and validate_label_value/2 performs no such parsing.
The architecture that makes this work: every call site that invokes the Oracle constructs the Labels argument as a list of ground Prolog terms, not strings. In the Go integration layer, the dispatch from Go to the Prolog Oracle passes node names as atoms drawn from the vocab map — the same RWMutex-protected vocabulary built in Chapter 19 that only admits names validated against known_node/1. The injection surface does not exist because the data type at the Prolog boundary is an atom, not a string, and the atom's identity is its value — there is no parsing step at which metacharacters could be smuggled.
21.5.3 Regex Label Matchers and Closed Vocabulary Enforcement
VictoriaMetrics supports three label matching operators: = (exact equality), != (inequality), =~ (regex match), !~ (regex non-match). The Oracle generates only = (exact equality) matchers. The =~ and !~ operators are not generated by any Oracle predicate — they require a string-valued regex pattern that cannot be safely constructed from a closed vocabulary of atoms without introducing a pattern-injection surface.
This is a deliberate restriction. The only queries that require regex matchers in the sovereign cluster's monitoring use cases are dashboard queries authored directly in Grafana by the operator, not machine-generated queries from the logic engine. The Oracle's output is consumed by the Chapter 22 assertion loop, which needs precise per-instance values. A query that returns all instances matching instance=~"pve[0-9]+" is useful for a Grafana panel but produces an unordered result set that the assertion loop must iterate — providing no value over the per-instance batch queries the Oracle already generates.
21.5.4 The Go Dispatch Guard
The Go layer provides a second, independent injection guard. The handleFirewallCheck and handleTopologyMutate handlers in Chapter 19 validate node names against knownTopologyNode() before constructing any Prolog goal string. The Oracle dispatch function in Chapter 22 applies the same pattern: node names are drawn from the Chapter 19 vocab map before being passed to the Prolog goal that calls promql_oracle:cpu_steal_query/2. The Go string that reaches the Prolog boundary is always a node name that has already cleared the known_node/1 WAM check at handleApproveNode time — it cannot be an attacker-supplied value injected at query time.
The defence is redundant by design: the Oracle's own validate_label_value/2 would reject an injected value even if the Go guard were absent, and the Go guard would reject an unknown node even if it somehow bypassed the Oracle's vocabulary check. Redundant defences at different abstraction layers are a deliberate architectural principle — neither layer trusts the other to have performed validation, so both perform it independently.
21.5.5 Oracle Security Verification
# Verify that a crafted injection string is rejected before query assembly:
root@logic-node-01:~# swipl -l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/promql_oracle.pl \
-g "
InjectedInstance = 'pve3:9100\",job=~\".*',
( promql_oracle:promql_query(
node_cpu_seconds_total,
[instance=InjectedInstance, mode=steal],
'5m',
Q
)
-> format('FAIL: injection produced query: ~w~n', [Q])
; writeln('PASS: injection rejected by validate_label_value/2')
),
halt
"
PASS: injection rejected by validate_label_value/2
# Verify that a regex operator cannot be injected via the mode label:
root@logic-node-01:~# swipl -l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/promql_oracle.pl \
-g "
( promql_oracle:promql_query(
node_cpu_seconds_total,
[instance='pve3:9100', mode='steal\"} offset 9999d #'],
'5m',
Q
)
-> format('FAIL: injection produced query: ~w~n', [Q])
; writeln('PASS: mode value rejected by known_label_value/2')
),
halt
"
PASS: mode value rejected by known_label_value/2
# Confirm a valid, safe query is produced for a known node:
root@logic-node-01:~# swipl -l /opt/logic-node/kb/proxmox_topology.pl \
-l /opt/logic-node/kb/promql_oracle.pl \
-g "
promql_oracle:cpu_steal_query('pve3:9100', Q),
format('Query: ~w~n', [Q]),
promql_oracle:zfs_arc_miss_rate_query('pve7:9100', Q2),
format('Query: ~w~n', [Q2]),
halt
"
Query: avg(irate(node_cpu_seconds_total{instance="pve3:9100",mode="steal"}[5m])) by (instance) * 100
Query: (irate(node_zfs_arc_misses_total{instance="pve7:9100"}[5m]) / (irate(node_zfs_arc_hits_total{instance="pve7:9100"}[5m]) + irate(node_zfs_arc_misses_total{instance="pve7:9100"}[5m]))) * 100
The Oracle produces syntactically valid, semantically correct, injection-immune PromQL strings for every metric in its vocabulary. Chapter 22 dispatches these strings against the VictoriaMetrics API, parses the JSON response, and asserts node_metric/4 facts into the WAM — closing the instrumentation loop that began with the bare-metal scrape pipeline built in Chapter 20.
No comments to display
No comments to display