Chapter 20: Bare-Metal Telemetry

The logic engine built across Chapters 15–19 reasons with perfect fidelity over the rules it has been given, but it has no mechanism to know that pve3's CPU steal climbed to 94% at 14:32, that the NVMe write latency on storage1 doubled six minutes ago, or that a bonded uplink on leaf_b has been dropping 0.3% of frames since the last kernel update — and without that physical state, any routing or policy decision it makes is reasoning about a model of the cluster rather than the cluster itself. This chapter builds the instrumentation layer that closes that gap: VictoriaMetrics as the single-node time-series engine receiving per-second scrapes from node_exporter on every hypervisor, the kernel tuning required to sustain million-point-per-second ingest on consumer NVMe, the hardened systemd units for both the sensor and the database, and the nftables rules that confine the ingest surface to the hypervisor VLAN while restricting Grafana visibility to the management VPN.

20.1 The Architecture of Sovereign Observability

20.1.1 Why Not Prometheus on Kubernetes

The standard enterprise answer to cluster monitoring in 2026 is Prometheus running inside the Kubernetes cluster it is monitoring, with its data stored on a PersistentVolume backed by the same storage subsystem it is observing. The architectural failure in this design is obvious once stated: when the cluster is sick, the monitoring system is sick with it. A storage performance regression that causes OOM evictions will evict the Prometheus pod. A network partition that isolates a rack will partition the monitoring data for that rack. A kernel bug that causes NVMe latency spikes will spike the write latency of the monitoring database that is supposed to detect the spike. The observer is entangled with the observed.

The sovereign architecture runs monitoring on a dedicated Linux VM with a fixed CPU and memory allocation that does not participate in the cluster's scheduling domain, does not share storage with any cluster workload, and does not depend on the cluster's control plane to stay running. The monitoring VM is operational when the cluster is down — which is precisely when monitoring data is most needed.

Prometheus is also the wrong tool for this workload on bare metal. Its storage model appends to per-series files; at 500 series per hypervisor across 14 Proxmox nodes, the scrape loop generates 7,000 concurrent file operations every 15 seconds. On a rotational disk this is untenable; on NVMe it is merely wasteful. VictoriaMetrics replaces the per-series file model with a columnar merge-tree structure that batches all incoming samples into an in-memory insert buffer, sorts and compresses them by metric name and timestamp, and flushes a single large sequential write to disk every few seconds. The result is one or two large sequential I/O operations per flush cycle regardless of series cardinality — a workload profile that NVMe is built for.

20.1.2 VictoriaMetrics Storage Architecture

VictoriaMetrics organises on-disk storage in monthly partitions. Each partition contains a set of immutable data parts — compressed columnar blocks, one column per metric, sorted by timestamp within each block. When new samples arrive they are held in a lockless in-memory insert buffer (rawRows). The background merger goroutine sorts, deduplicates, and compresses the buffer into small parts, then progressively merges small parts into large parts in a background merge tree — the architecture Prometheus borrows terminology from but does not implement. Large parts are the unit of long-term retention; when a part's timestamp range falls outside the -retentionPeriod window, the entire part is deleted with a single os.RemoveAll call rather than a per-point tombstone scan.

The practical consequences for this deployment:

Ingest rate:        14 nodes × 1,200 metrics/node × 1 sample/15s = 1,120 samples/s
Peak burst:         All nodes scraping simultaneously: ~16,800 samples in one second
Insert buffer:      In-memory, lockless — no disk I/O during scrape receipt
Flush interval:     ~5 seconds (configurable via -inmemoryDataFlushInterval)
Flush I/O profile:  1–3 sequential writes per flush, 64–256KB each
NVMe write IOPS:    ~2–3 per flush cycle (vs. 7,000 for Prometheus at same cardinality)
Compression ratio:  Gorilla + ZSTD on timestamp deltas: ~12:1 vs raw float64 stream
Retention storage:  1 year of 14-node telemetry at 1,200 series: ~18GB

20.1.3 Telemetry Flow Diagram

%%{init: {"themeVariables": {"fontSize": "14px"}}}%%
flowchart TD
    PVE1["pve1 — Hypervisor\nnode_exporter :9100\n1,200 metrics/scrape\nSystemd hardened unit\nnon-privileged node-exp user"]

    PVE2["pve2 — Hypervisor\nnode_exporter :9100\n1,200 metrics/scrape\nSystemd hardened unit\nnon-privileged node-exp user"]

    PVEN["pve3…pve14 — Hypervisors\nnode_exporter :9100\n1,200 metrics/scrape each\nIdentical hardened units"]

    VLAN["Metrics-Only VLAN 40\n10.40.0.0/24\nnftables: ingress from 10.40.0.0/24 only\nno route to cluster data plane\nno route to management VPN\ningest-only network segment"]

    VM["VictoriaMetrics — obs-01 VM\n10.40.0.2 (ingest, VLAN 40)\n192.168.100.5 (mgmt, VLAN 10)\nPort 8428: Prometheus-compatible ingest\nMerge-tree columnar storage\n-retentionPeriod=12\nZFS dataset: 128K recordsize"]

    GRAFANA["Grafana — obs-01 VM\nPort 3000: dashboard UI\nnftables: port 3000 reachable from\nManagement VPN only (10.99.0.0/24)\nDatasource: http://localhost:8428\nPrometheus-compatible query API"]

    LOGIC["Logic Node — logic-node-01\nChapter 20 integration target\nPolls /api/v1/query for node_metrics\nasserts node_metric/4 facts into WAM\nhealthy_node/1 guard reads live telemetry\nChapter 21 build"]

    PVE1 --->|"HTTP scrape every 15s"| VLAN
    PVE2 --->|"HTTP scrape every 15s"| VLAN
    PVEN --->|"HTTP scrape every 15s"| VLAN
    VLAN --->|"port 8428 ingest only"| VM
    VM --->|"localhost query API"| GRAFANA
    VM --->|"PromQL /api/v1/query"| LOGIC

    style PVE1 fill:#1A2B4A,color:#FFFFFF
    style PVE2 fill:#1A2B4A,color:#FFFFFF
    style PVEN fill:#1A2B4A,color:#FFFFFF
    style VLAN fill:#8B6914,color:#FFFFFF
    style VM fill:#5A1A6A,color:#FFFFFF
    style GRAFANA fill:#1A4070,color:#FFFFFF
    style LOGIC fill:#1A6B3A,color:#FFFFFF

20.2 Kernel Tuning for Time-Series Workloads

20.2.1 `/etc/sysctl.d/99-observability.conf`

VictoriaMetrics holds a large number of concurrent TCP connections open — one per scrape target per scrape interval — and maps significant portions of its data directory into the process's virtual address space via mmap. The default kernel parameters are tuned for general-purpose workloads. A dedicated monitoring VM requires explicit tuning for high socket concurrency, deep I/O queues, and large memory-mapped file populations.

# /etc/sysctl.d/99-observability.conf
# Applied on obs-01 (VictoriaMetrics + Grafana VM).
# Load immediately: sysctl --system
# Verify a specific key: sysctl net.core.somaxconn

# ── Network: socket concurrency ────────────────────────────────────────────

# Maximum backlog for listen(2) — the number of completed connections waiting
# to be accepted. At 14 scrape targets × 4 concurrent requests each, the
# default 128 is tight. 4096 provides headroom for burst re-connections after
# a node restart (all exporters reconnect simultaneously).
net.core.somaxconn = 4096

# Maximum number of packets queued on the INPUT side before the kernel drops.
# VictoriaMetrics ingest is bursty at scrape intervals — all 14 nodes deliver
# their scrape payload within a 200ms window every 15 seconds.
net.core.netdev_max_backlog = 16384

# TCP receive and send buffer sizes.
# node_exporter scrape payloads are 200–800KB per node (uncompressed text format).
# 128KB default rmem is undersized; 16MB max allows the kernel to auto-tune
# per-socket based on RTT and available memory.
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Increase the local port range for outbound connections (VictoriaMetrics
# scraping exporters). The default 32768–60999 allows 28,231 simultaneous
# outbound connections; the wider range prevents ephemeral port exhaustion
# during burst reconnections.
net.ipv4.ip_local_port_range = 1024 65535

# TCP FIN-WAIT-2 timeout: reduce from 60s to 15s.
# Stale FIN-WAIT-2 sockets from closed scrape connections accumulate at the
# default timeout; 15 seconds clears them before the next scrape cycle.
net.ipv4.tcp_fin_timeout = 15

# Enable TCP fast open for outbound connections (VictoriaMetrics -> exporters).
# Reduces connection latency on short-lived scrape TCP sessions.
# 3 = enable for both client and server.
net.ipv4.tcp_fastopen = 3

# ── Virtual memory: mmap population ────────────────────────────────────────

# VictoriaMetrics maps each on-disk data part into the process address space
# via mmap for zero-copy reads during queries. Each active data part consumes
# one or more VMAs (virtual memory areas). At 12 months retention × 14 nodes
# × 1,200 series, the number of active parts at any time reaches 8,000–12,000.
# The default vm.max_map_count of 65,530 is sufficient but leaves no headroom.
# 524,288 provides a safe ceiling without approaching the kernel's hard limit.
vm.max_map_count = 524288

# Kernel swappiness: bias strongly against swapping.
# If the VictoriaMetrics insert buffer is swapped to disk, ingest throughput
# collapses. The monitoring VM has dedicated RAM; there is no justification
# for swapping any of it to a storage device the monitoring system is watching.
vm.swappiness = 1

# Dirty page writeback: tune for large sequential I/O bursts.
# vm.dirty_ratio: percentage of total RAM that can be dirty before the writing
# process itself blocks. 20% allows VictoriaMetrics to buffer large merge
# outputs in the page cache before writeback.
# vm.dirty_background_ratio: percentage at which background writeback begins.
# 5% starts background writeback early, keeping dirty pages from accumulating.
vm.dirty_ratio = 20
vm.dirty_background_ratio = 5

# ── io_uring: async I/O for merge flushes ──────────────────────────────────

# io_uring_disabled: 0 = enabled for all processes.
# VictoriaMetrics uses io_uring for its background merge flush operations
# when available. io_uring eliminates the syscall overhead of pwrite(2) for
# each 64KB block write during a merge flush, reducing flush latency by
# 15–30% on NVMe at the 14-node ingest rate.
# Default on recent kernels is 0 (enabled); explicit here for auditability.
kernel.io_uring_disabled = 0

# io_uring_group: -1 = all groups permitted.
# Restrict in production to the victoriametrics service group if the threat
# model requires it (see Section 20.5). Default permissive for this VM.
# kernel.io_uring_group = GID_OF_victoriametrics_user

# Apply immediately without reboot:
root@obs-01:~# sysctl --system
root@obs-01:~# sysctl vm.max_map_count net.core.somaxconn kernel.io_uring_disabled

vm.max_map_count = 524288
net.core.somaxconn = 4096
kernel.io_uring_disabled = 0

20.2.2 ZFS Dataset Configuration for TSDB Workloads

The VictoriaMetrics storage directory sits on a ZFS dataset. ZFS recordsize is the fundamental I/O granularity — reads and writes are aligned to this boundary. The wrong recordsize for a TSDB workload imposes significant read amplification during queries and write amplification during compaction.

VictoriaMetrics produces two distinct I/O patterns that require different recordsize values:

WAL / insert buffer flushes:
  Pattern: sequential writes, 4KB–256KB per operation
  Ideal recordsize: 128K — one or two records per flush operation,
  maximising sequential throughput on NVMe, minimising metadata overhead.
  ZFS compression: lz4 — fast compression, high ratio on time-series deltas.

Long-term block storage (merged parts, query reads):
  Pattern: sequential reads of entire data parts (1MB–100MB)
  Ideal recordsize: 128K — aligned to VictoriaMetrics part read granularity,
  maximising ARC cache efficiency for hot time ranges.
  ZFS compression: zstd-3 — higher ratio than lz4 (12:1 vs 8:1 on metric data),
  acceptable decompression latency for query workloads (not write-hot path).

Both patterns favour 128K recordsize. The TSDB workload is the case ZFS's adaptive record size was designed for: large sequential I/O, never random 4KB reads (which would favour a 4K or 8K recordsize).

Two additional dataset properties matter for a merge-tree TSDB. atime=off is inherited from the pool default set in the zpool create command below, but is set explicitly on the dataset as well — VictoriaMetrics accesses data parts for reads constantly during queries, and access-time updates on every read would generate write traffic that competes with merge flushes on the same NVMe device.

redundant_metadata=most is the critical property for this workload. By default ZFS stores only one copy of filesystem metadata (dnode blocks, indirect blocks, the metadata for each data part). With redundant_metadata=most, ZFS stores two copies of all metadata blocks — but still only one copy of the actual data blocks, preserving the 128K sequential write profile for metric data. The consequence: a single NVMe block failure that corrupts a metadata block does not lose the index entry for months of time-series data. For a monitoring dataset where re-ingestion is impossible — you cannot reconstruct historical CPU steal metrics from a destroyed index — metadata redundancy is the correct trade-off against the marginal write amplification it introduces.

# Create the ZFS pool and dataset for VictoriaMetrics storage.
# Replace 'nvme0n1' and 'nvme1n1' with the actual NVMe device names.

# Pool: mirror of two NVMe drives for single-disk fault tolerance.
# ashift=12: 4K sector alignment (standard for modern NVMe).
root@obs-01:~# zpool create \
    -o ashift=12 \
    -O atime=off \
    -O compression=lz4 \
    -O xattr=sa \
    -O dnodesize=auto \
    vmpool mirror nvme0n1 nvme1n1

# Dataset for VictoriaMetrics data — 128K recordsize, zstd compression.
# The lz4 pool default is overridden here to zstd-3 for the query-hot
# long-term block storage. The WAL flush path benefits from zstd's higher
# ratio on time-series delta encoding.
root@obs-01:~# zfs create \
    -o mountpoint=/var/lib/victoria-metrics \
    -o recordsize=128K \
    -o compression=zstd-3 \
    -o logbias=throughput \
    -o atime=off \
    -o xattr=sa \
    -o redundant_metadata=most \
    vmpool/victoria-metrics

# logbias=throughput: instructs ZFS to write directly to the main storagepool, # pool rather than routing synchronous writes throughbypassing the ZIL (ZFS Intent Log).ZIL.
# VictoriaMetrics has its own WAL mechanism —WAL; ZFS synchronous write guarantees are
# are redundant overhead for this workload.
# Verify:atime=off: explicit on the dataset even though inherited from the pool.
# xattr=sa: store extended attributes in the dnode (inline), eliminating
# a separate metadata lookup per file access.
# redundant_metadata=most: two copies of all metadata blocks, one copy of
# data blocks. Protects the time-series index against single-block NVMe
# failure without doubling data storage.

# Verify all properties:
root@obs-01:~# zfs get recordsize,compression,logbiaslogbias,atime,xattr,redundant_metadata vmpool/victoria-metrics

NAME                        PROPERTY           VALUE       SOURCE
vmpool/victoria-metrics     recordsize         128K        local
vmpool/victoria-metrics     compression        zstd-3      local
vmpool/victoria-metrics     logbias            throughput  local
vmpool/victoria-metrics     atime              off         local
vmpool/victoria-metrics     xattr              sa          local
vmpool/victoria-metrics     redundant_metadata most        local

20.3 Deploying the Sensors: `node_exporter`

20.3.1 Installation

# On each Proxmox hypervisor node: pve1 through pve14.
# node_exporter version: pin to a specific release for reproducible deployments.
# Replace VERSION with the current release from github.com/prometheus/node_exporter.

VERSION="1.8.2"
ARCH="linux-amd64"
TARBALL="node_exporter-${VERSION}.${ARCH}.tar.gz"

root@pve1:~# cd /tmp
root@pve1:~# wget -q \
    "https://github.com/prometheus/node_exporter/releases/download/v${VERSION}/${TARBALL}"
root@pve1:~# wget -q \
    "https://github.com/prometheus/node_exporter/releases/download/v${VERSION}/${TARBALL}.sha256sum"

# Verify checksum before extracting — supply chain integrity.
root@pve1:~# sha256sum --check "${TARBALL}.sha256sum"
node_exporter-1.8.2.linux-amd64.tar.gz: OK

root@pve1:~# tar -xzf "${TARBALL}"
root@pve1:~# install -o root -g root -m 0755 \
    "node_exporter-${VERSION}.${ARCH}/node_exporter" \
    /usr/local/bin/node_exporter

# Create the non-privileged service user.
# --system: creates a system account (UID < 1000, no login shell, no home dir).
# --no-create-home: no home directory — the exporter reads /proc and /sys,
# not files in a home directory.
# --shell /usr/sbin/nologin: prevents interactive login.
root@pve1:~# useradd \
    --system \
    --no-create-home \
    --shell /usr/sbin/nologin \
    node-exp

20.3.2 Hardened systemd Unit

# /etc/systemd/system/node_exporter.service
# Deploy identically to all 14 Proxmox hypervisor nodes.
# After placement: systemctl daemon-reload && systemctl enable --now node_exporter

[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=node-exp
Group=node-exp
ExecStart=/usr/local/bin/node_exporter \
    --web.listen-address="0.0.0.0:9100" \
    --web.telemetry-path="/metrics" \
    --collector.filesystem.mount-points-exclude="^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/.+)($|/)" \
    --collector.netdev.device-exclude="^(veth|docker|br-|virbr)" \
    --no-collector.wifi \
    --no-collector.hwmon

Restart=on-failure
RestartSec=5s
StartLimitIntervalSec=60s
StartLimitBurst=5

ProtectSystem=strict
ProtectHome=true
PrivateTmp=true

CapabilityBoundingSet=
AmbientCapabilities=

SystemCallFilter=@system-service
SystemCallFilter=~@privileged ~@obsolete
SystemCallErrorNumber=EPERM

IPAddressAllow=10.40.0.0/24 127.0.0.1/8 ::1/128
IPAddressDeny=any

MemoryMax=128M
CPUQuota=20%
LimitNOFILE=65536

NoNewPrivileges=true
PrivateDevices=true
ProtectKernelModules=true
ProtectKernelTunables=true
ProtectKernelLogs=true
ProtectControlGroups=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictRealtime=true
RestrictSUIDSGID=true
RemoveIPC=true

[Install]
WantedBy=multi-user.target

Unit annotation:

ProtectSystem=strict remounts /usr, /boot, and /etc read-only for this unit. The exporter reads /proc and /sys — it has no legitimate need to write anywhere. A compromised binary cannot modify system files or host configuration.

CapabilityBoundingSet= (empty) means the process cannot acquire any Linux capability even via a setuid binary or capability-aware exploit. An attacker achieving RCE in the exporter has no CAP_NET_ADMIN, no CAP_SYS_PTRACE, no CAP_DAC_OVERRIDE.

SystemCallFilter=@system-service with ~@privileged ~@obsolete restricts the syscall surface to what a Go binary legitimately requires and denies the entire privileged syscall group (mount, setuid, etc.).

IPAddressAllow=10.40.0.0/24 127.0.0.1/8 ::1/128 restricts accepted connections to the metrics VLAN scraper. IPAddressDeny=any blocks all other sources. A compromised exporter cannot exfiltrate data to an external host.

StartLimitBurst=5 over StartLimitIntervalSec=60s enters failed state after five crashes in one minute, requiring manual intervention. This prevents a crash loop from consuming CPU on a degraded hypervisor.

# Deploy and verify on each hypervisor:
root@pve1:~# systemctl daemon-reload
root@pve1:~# systemctl enable --now node_exporter
root@pve1:~# systemctl status node_exporter

● node_exporter.service - Prometheus Node Exporter
     Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled)
     Active: active (running) since 2026-03-06 15:44:12 UTC; 3s ago
   Main PID: 12847 (node_exporter)

# Verify metrics endpoint reachable from the monitoring VLAN:
root@pve1:~# curl -s http://10.40.0.11:9100/metrics | head -5
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 4.29e-05
go_gc_duration_seconds{quantile="0.25"} 5.2e-05
go_gc_duration_seconds{quantile="0.5"} 6.11e-05

# Verify capability set is empty:
root@pve1:~# grep CapBnd /proc/$(pgrep node_exporter)/status
CapBnd: 0000000000000000

20.4 The Build: VictoriaMetrics and Grafana

20.4.1 VictoriaMetrics Installation

# On obs-01 (the dedicated monitoring VM).
# VictoriaMetrics single-node binary — no operator, no cluster coordination,
# no Kubernetes. One binary, one storage directory, one port.

VERSION="1.101.0"
ARCH="linux-amd64"
TARBALL="victoria-metrics-${ARCH}-v${VERSION}.tar.gz"

root@obs-01:~# cd /tmp
root@obs-01:~# wget -q \
    "https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v${VERSION}/${TARBALL}"
root@obs-01:~# wget -q \
    "https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v${VERSION}/${TARBALL}_checksums.txt"

root@obs-01:~# sha256sum --ignore-missing --check "${TARBALL}_checksums.txt"
victoria-metrics-linux-amd64-v1.101.0.tar.gz: OK

root@obs-01:~# tar -xzf "${TARBALL}"
root@obs-01:~# install -o root -g root -m 0755 \
    victoria-metrics-prod \
    /usr/local/bin/victoria-metrics

# Service user — same non-privileged pattern as node_exporter.
root@obs-01:~# useradd \
    --system \
    --no-create-home \
    --shell /usr/sbin/nologin \
    victoriametrics

# Ensure the ZFS dataset is owned by the service user.
root@obs-01:~# chown victoriametrics:victoriametrics /var/lib/victoria-metrics

20.4.2 `victoria-metrics.service`

# /etc/systemd/system/victoria-metrics.service

[Unit]
Description=VictoriaMetrics Time Series Database
Documentation=https://docs.victoriametrics.com
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=victoriametrics
Group=victoriametrics

ExecStart=/usr/local/bin/victoria-metrics \
    -storageDataPath=/var/lib/victoria-metrics \
    -retentionPeriod=12 \
    -httpListenAddr=0.0.0.0:8428 \
    -promscrape.config=/etc/victoria-metrics/scrape.yml \
    -promscrape.configCheckInterval=30s \
    -insert.maxQueueDuration=30s \
    -search.maxConcurrentRequests=8 \
    -search.maxQueryDuration=60s \
    -memory.allowedPercent=60 \
    -loggerLevel=INFO \
    -loggerOutput=stderr

Restart=on-failure
RestartSec=10s
StartLimitIntervalSec=120s
StartLimitBurst=3

ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/lib/victoria-metrics

CapabilityBoundingSet=
AmbientCapabilities=
NoNewPrivileges=true

LimitNOFILE=1048576
LimitCORE=0

PrivateDevices=true
ProtectKernelModules=true
ProtectKernelTunables=true
ProtectKernelLogs=true
ProtectControlGroups=true
LockPersonality=true
RestrictRealtime=true
RestrictSUIDSGID=true
RemoveIPC=true
SystemCallFilter=@system-service
SystemCallFilter=~@privileged ~@obsolete
SystemCallErrorNumber=EPERM

[Install]
WantedBy=multi-user.target

Flag and directive annotation:

-retentionPeriod=12 retains 12 months of data. Parts older than 12 months are deleted in their entirety at the next background merge pass — no per-point tombstoning, no compaction scan.

-memory.allowedPercent=60 caps VictoriaMetrics in-memory structures at 60% of available RAM. The remaining 40% is available for the ZFS ARC cache, which accelerates query reads of recently-accessed data parts. Raising this above 60% starves the ARC and degrades query latency more than the larger insert buffer benefits ingest throughput at this cardinality.

-search.maxConcurrentRequests=8 caps concurrent PromQL queries. Grafana dashboards with 20+ panels issue queries in parallel on every refresh; without a ceiling, a single dashboard reload can saturate the query engine and block scrape ingest acknowledgements.

-promscrape.configCheckInterval=30s reloads scrape.yml every 30 seconds without a process restart. Adding a new hypervisor to the cluster means appending one target entry to scrape.yml; the change is live within 30 seconds.

LimitNOFILE=1048576 accommodates one file descriptor per mmap'd data part plus scrape connections. At 12 months × 14 nodes × ~60 parts/month the active mmap count reaches approximately 10,000. LimitCORE=0 disables core dumps; the process holds infrastructure topology data in memory and a core dump written to disk is uncontrolled data exfiltration.

20.4.3 Scrape Configuration `/etc/victoria-metrics/scrape.yml`

# /etc/victoria-metrics/scrape.yml
# Loaded by VictoriaMetrics at startup and reloadedReloaded every 30 seconds # (see -promscrape.configCheckInterval in the service unit)configCheckInterval).
# This file is the complete inventory of scrape targets.
# Adding pve15pve15: requires appendingappend one entry underto hypervisors targetstargets, # and waiting up towait 30 seconds for the reload.seconds.

global:
  scrape_interval: 15s
  scrape_timeout:  10s
  # External labels are attached to every metric ingested from this
  # VictoriaMetrics instance. They allow disambiguation when federating
  # multiple VM instances in future or when querying from the logic engine.
  external_labels:
    cluster:     proxmox-sovereign
    environment: production

scrape_configs:
  - job_name: hypervisors
    static_configs:
      - targets:
          - "10.40.0.11:9100"
          # pve1
          - "10.40.0.12:9100"   # pve2
          - "10.40.0.13:9100"
          # pve3
          - "10.40.0.14:9100"   # pve4
          - "10.40.0.15:9100"
          # pve5
          - "10.40.0.16:9100"   # pve6
          - "10.40.0.17:9100"
          # pve7
          - "10.40.0.18:9100"   # pve8
          - "10.40.0.19:9100"
          # pve9
          - "10.40.0.20:9100"   # pve10
          - "10.40.0.21:9100"
          # pve11
          - "10.40.0.22:9100"   # pve12
          - "10.40.0.23:9100"
          # pve13
          - "10.40.0.24:9100"
    # pve14
    relabel_configs:
      # Extract the hostname from the scrape address for use in dashboards.
      - source_labels: [__address__]
        regex: "10\\.40\\.0\\.(\\d+):9100"
        target_label: instance
        replacement: "pve${1}"
    # Retain only the metrics required for the Chapter 21 logic integration.
      # Dropping unused metrics at ingest time reduces storage by ~40% and
      # improves query performance by reducing series cardinality.
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: >-
          "node_cpu_seconds_total|
          node_memory_MemAvailable_bytes|
          node_memory_MemTotal_bytes|
          node_disk_read_bytes_total|
          node_disk_written_bytes_total|
          node_disk_io_time_seconds_total|
          node_disk_read_time_seconds_total|
          node_disk_write_time_seconds_total|
          node_network_receive_errs_total|
          node_network_transmit_errs_total|
          node_network_receive_drop_total|
          node_network_transmit_drop_total|
          node_load1|node_load5|node_load15|
          node_filesystem_avail_bytes|
          node_filesystem_size_bytes|
          node_exporter_build_infonode_exporter_build_info"
        action: keep

  - job_name: victoria-metrics-self
    static_configs:
      - targets: ["localhost:8428"]
#

Scrape configuration annotation:

external_labels are attached to every metric ingested from this VictoriaMetrics ~~exposes~~instance, ~~its~~enabling disambiguation when federating multiple instances or querying from the logic engine.

The relabel_configs rule extracts the trailing octet of the scrape IP (e.g. 10.40.0.13 → pve3) and writes it as the instance label, giving dashboards readable hostnames rather than IP addresses.

The metric_relabel_configs keep-list drops all metric families not required for the Chapter 21 logic integration. Dropping unused metrics at ingest time reduces storage by approximately 40% and improves query performance by reducing series cardinality. The keep regex is a single-line alternation to avoid YAML multi-line string parsing edge cases.

The victoria-metrics-self job scrapes VictoriaMetrics' own ~~operational~~/metrics ~~metrics~~endpoint, ~~on /metrics. # Scraping itself provides~~providing insert rate, query duration, storage size, # merge duration, and error counts — the baseline for detecting when the # monitoring system itself is degraded.

20.4.4 Grafana Installation and Datasource Configuration

# Install Grafana OSS on obs-01.
# Using the official APT repository for reproducible upgrades.

root@obs-01:~# apt-get install -y apt-transport-https software-properties-common
root@obs-01:~# wget -q -O - https://apt.grafana.com/gpg.key | \
    gpg --dearmor -o /usr/share/keyrings/grafana.gpg
root@obs-01:~# echo \
    "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://apt.grafana.com stable main" \
    > /etc/apt/sources.list.d/grafana.list
root@obs-01:~# apt-get update && apt-get install -y grafana

# Grafana listens on port 3000 by default. The nftables rules in Section 20.5
# restrict port 3000 to the management VPN (10.99.0.0/24) only.
root@obs-01:~# systemctl enable --now grafana-server

# /etc/grafana/provisioning/datasources/victoriametrics.yaml
# # Provisioned datasources are immutable from the Grafana UI — they cannot be
# edited or deleted by dashboard users. The datasource is defined in code,
# version-controlled, and deployed with the monitoring VM. This prevents
# a misconfigured UI change from breaking all dashboards silently.UI.

apiVersion: 1

datasources:
  - name:      VictoriaMetrics
    type:      prometheus
    access:    proxy
    # VictoriaMetrics exposes a Prometheus-compatible query API on /api/v1.
    # Grafana communicates to it over localhost — no network hop.
    url:       http://localhost:8428
    isDefault: true
    editable:  false
    jsonData:
      #timeInterval:                  "15s"
      queryTimeout:                  "60s"
      httpMethod:                    POST
      incrementalQuerying:           true
      incrementalQueryOverlapWindow: "10m"

Datasource annotation:

Provisioned datasources are defined in code and deployed with the monitoring VM. They cannot be edited or deleted from the Grafana UI, preventing a misconfigured dashboard change from breaking all panels silently.

timeInterval: "15s" must match ~~the~~ scrape_interval in scrape.yml. #yml. Grafana uses this value to calculate the minimum meaningful resolution # for rate() and increase() ~~functions.~~functions; ~~Mismatching~~a mismatch produces misleading # graph resolution on fast-changing metrics ~~like~~such as CPU steal.

timeInterval: "15s" # queryTimeout: cap individual PromQL queries at 60s. # Matches -search.maxQueryDuration in the victoria-metrics service unit. queryTimeout: "60s" #

httpMethod: POST is required for long PromQL queries (whose label matchers ~~# that~~would exceed URL length limits when encoded as GET ~~parameters).~~parameters.

~~httpMethod: POST #~~

incrementalQuerying: VictoriaMetricstrue ~~supports incremental query # execution —~~instructs Grafana ~~fetches~~to fetch only the new time range on each panel refresh # rather than re-fetching the full visible window. At ~~15s~~a 15-second scrape interval # this reduces query load on dashboard auto-refresh by 80–90%. ~~incrementalQuerying: true incrementalQueryOverlapWindow: "10m"~~

# Restart Grafana to load the provisioned datasource:
root@obs-01:~# systemctl restart grafana-server

# Verify datasource is live:
root@obs-01:~# curl -s \
    -u admin:admin \
    http://localhost:3000/api/datasources | \
    python3 -c "import sys,json; ds=json.load(sys.stdin); print(ds[0]['name'], ds[0]['url'])"
VictoriaMetrics http://localhost:8428

20.5 Security: Ingest Isolation

20.5.1 Network Topology

The monitoring VM (obs-01) has two network interfaces:

eth0  — 10.40.0.2/24     VLAN 40: Metrics-Only VLAN
                          Reachable from: 10.40.0.11–10.40.0.24 (hypervisors)
                          Purpose: VictoriaMetrics ingest port 8428 only
                          No route to cluster data plane (10.0.0.0/8)
                          No route to management VPN (10.99.0.0/24)

eth1  — 192.168.100.5/24 VLAN 10: Management network
                          Reachable from: 10.99.0.0/24 (management VPN)
                          Purpose: Grafana UI port 3000, SSH port 22
                          No route to cluster data plane

VictoriaMetrics binds its ingest and query API to 0.0.0.0:8428. Without firewall rules, a management VPN client could POST fabricated metrics to port 8428 and corrupt the time-series database — either as a DoS (inserting NaN values for node_cpu_seconds_total) or as a deception attack (inserting false metric values to manipulate the Chapter 21 logic engine's healthy_node/1 guard). The nftables rules below enforce that port 8428 is reachable only from the metrics VLAN and that Grafana is reachable only from the management VPN.

20.5.2 nftables Ruleset

# /etc/nftables.conf — complete ruleset for obs-01.
# Apply:   nft -f /etc/nftables.conf
# Persist across reboots:Persist: systemctl enable nftables

table inet filter {

    # ── Inbound connection tracking ─────────────────────────────────────────

    chain input {
        type filter hook input priority 0; policy drop;

        # Accept established and established/related connections (TCPand state machine).loopback.
        ct state established,related accept
        # Accept loopback — VictoriaMetrics and Grafana communicate over
        # localhost; blocking loopback would break the datasource connection.
        iif lo accept

        # ICMP: acceptICMP echo-request for— network diagnostics.
        # Rate-limit to 10/secondrate-limited to prevent ICMP flooding.
        ip  protocol icmp   icmp  type echo-request limit rate 10/second accept
        ip6 nexthdr  icmpv6 icmpv6 type echo-request limit rate 10/second accept

        # ──SSH SSH:— management interface only(eth1) ───────────────────────────────────
        # SSH is available only on the management VLAN (eth1, 192.168.100.0/24).
        # An attacker on the metrics VLAN who compromises a scrape endpoint
        # cannot reach SSH on obs-01.only.
        iif eth1 ip saddr 192.168.100.0/24 tcp dport 22 ct state new accept

        # ── VictoriaMetrics ingest:ingest (port 8428) — metrics VLAN only(eth0) ────────────────────────only.
        # Port 8428 accepts connections only from hypervisor nodes on VLAN 40.
        # The source range is /24; only 10.40.0.11–10.40.0.24 are populated,
        # but /24 is used to accommodateaccommodates future hypervisor additions without # a rule change. A tighter /29 or individual host rules are appropriate
        # if the metrics VLAN is shared with other devices.
        iif eth0 ip saddr 10.40.0.0/24 tcp dport 8428 ct state new accept

        # VictoriaMetricsGrafana alsoUI exposes (port 8428 for Grafana queries3000) — but
        # Grafana runs on localhost and reaches VM via loopback (accepted above).
        # No external access to port 8428 from the management network is needed
        # or permitted. Grafana is the query interface for humans; the logic
        # engine (Chapter 21) polls the PromQL API via localhost on obs-01 or
        # via an internal API gateway — not directly from the cluster network.

        # ── Grafana UI: management VPN only ─────────────────────────────────
        # Port 3000 is reachable only from the management VPN range.
        # Grafana does not implement mTLS or client certificate authentication
        # by default — the network layer is the primary access control boundary.only.
        iif eth1 ip saddr 10.99.0.0/24 tcp dport 3000 ct state new accept

        # DropLog and drop everything else — including any attempt to reach port 8428
        # from the management interface, or port 3000 from the metrics VLAN.
        # The policy drop at the chain level handles this; explicit log entries
        # aid incident investigation.else.
        log prefix "[nftables-DROP] " flags all drop
    }

    chain forward {
        # obs-01 is not a router.
        Drop all forwarded packets.
        type filter hook forward priority 0; policy drop;
    }

    chain output {
        # Allow all outbound traffic from obs-01.
        # VictoriaMetrics initiates scrape connections to 10.40.0.0/24:9100.
        # Grafana initiates connections to localhost:8428.
        # No outbound restriction is required on a dedicated monitoring— VM
        # that has no internet route to the internet (VLAN topology enforces this).
        type filter hook output priority 0; policy accept;
    }
}

Ruleset annotation:

Port 8428 is accepted only on eth0 (VLAN 40, metrics network) from 10.40.0.0/24. A management VPN client on eth1 attempting to POST fabricated metrics to port 8428 will hit the policy drop before any application code runs.

Port 3000 is accepted only on eth1 from the management VPN range 10.99.0.0/24. An operator on the metrics VLAN cannot reach the Grafana UI.

Grafana reaches VictoriaMetrics via localhost:8428 — the loopback accept rule covers this. No external access to port 8428 from eth1 is needed or permitted.

log prefix "[nftables-DROP]" writes all dropped packet metadata to the kernel log (journalctl -k -g nftables-DROP). This provides the audit trail required to investigate connection attempts that bypass the expected source ranges.

# Apply andthe verify:ruleset:
root@obs-01:~# nft -f /etc/nftables.conf

root@obs-01:~# nft list ruleset | grep -A3 "chain input"
    chain input {
        type filter hook input priority 0; policy drop;
        ct state established,related accept
        iif "lo" accept

# Enable nftablespersistence persistence:across reboots:
root@obs-01:~# systemctl enable nftables
root@obs-01:~# systemctl start--now nftables

# Verify the rulesactive areinput active:chain:
root@obs-01:~# nft list chain inet filter input

table inet filter {
    chain input {
        type filter hook input priority filter; policy drop;
        ct state { established, related } accept
        iif "lo" accept
        ip protocol icmp icmp type echo-request limit rate 10/second burst 5 packets accept
        ip6 nexthdr ipv6-icmp icmpv6 type echo-request limit rate 10/second burst 5 packets accept
        iif "eth1" ip saddr 192.168.100.0/24 tcp dport 22 ct state new accept
        iif "eth0" ip saddr 10.40.0.0/24 tcp dport 8428 ct state new accept
        iif "eth1" ip saddr 10.99.0.0/24 tcp dport 3000 ct state new accept
        log prefix "[nftables-DROP] " flags all drop
    }
}

20.5.3 Ingest Integrity: Rejecting Fabricated Metrics

The nftables rules enforce network-layer source isolation. They do not authenticate the content of what arrives on port 8428. A legitimate scrape endpoint that is compromised — a hypervisor where an attacker has achieved local code execution — can POST arbitrary metric values to VictoriaMetrics from within the trusted source range:

# An attacker with access to pve3's network stack can do this:
curl -X POST http://10.40.0.2:8428/api/v1/import/prometheus \
     -d 'node_cpu_seconds_total{cpu="0",mode="steal",instance="pve3"} 9999999.0 1741267200000'

This inserts a fabricated cpu_steal value for pve3 that will cause the Chapter 21 logic engine to mark pve3 as unhealthy and stop routing traffic to it — a logic-level DoS executed through the telemetry pipeline.

Two controls bound this risk. First, VictoriaMetrics --maxLabelsPerTimeseries (default 30) and --maxLabelValueLen (default 256) cap the structural complexity of inbound data. Second, the scrape.yml metric_relabel_configs in Section 20.4.3 drops any metric not in the explicit keep list — VictoriaMetrics only retains the fourteen named metric families. A fabricated metric name outside that list is silently discarded before it reaches storage.

For the remaining risk — a fabricated value for a legitimate metric name — the Chapter 21 logic integration addresses this by applying rate-of-change guards in the Prolog KB: cpu_steal_valid(Node, Value) succeeds only if Value is between 0.0 and 100.0 and the delta from the previous sample does not exceed 40 percentage points in a single 15-second window. A sample jumping from 12.3% to 9,999,999.0 fails the delta guard and is not asserted into the live KB. ~~The monitoring stack is the data source; the logic engine is the final validation layer for values that will affect cluster routing decisions.~~

# Final validation: end-to-end scrape confirmed.
root@obs-01:~# curl -s \
    "http://localhost:8428/api/v1/query?query=up{job%up%7Bjob%3D%22hypervisors%22}"22%7D" | \
    python3 -c "
import sys, json
r = json.load(sys.stdin)
results = r['data']['result']
print(f'Hypervisors reporting: {len(results)}/14')
for m in sorted(results, key=lambda x: x['metric']['instance']):
    print(f\"f'  {m['metric'"metric"]['instance'"instance"]}: up={m['value'"value"][1]}\"')
"

Hypervisors reporting: 14/14
  pve1:  up=1
  pve2:  up=1
  pve3:  up=1
  pve4:  up=1
  pve5:  up=1
  pve6:  up=1
  pve7:  up=1
  pve8:  up=1
  pve9:  up=1
  pve10: up=1
  pve11: up=1
  pve12: up=1
  pve13: up=1
  pve14: up=1

20.5.4 Forward Path: Distributed Logic Harvesters

The telemetry architecture in this chapter uses a single polling model: the VictoriaMetrics instance on obs-01 scrapes all 14 hypervisors on a 15-second cycle, and the Chapter 21 logic engine polls obs-01's PromQL API to assert node_metric/4 facts into the central WAM. This is the correct starting point — a single ingest point with a single authoritative TSDB is operationally simple and auditable.

At scale, a more capable architecture distributes the reasoning step closer to the data source using SWI-Prolog's Pengines (Pengine servers). Instead of the central orchestrator polling a flat metric stream and asserting every hypervisor's telemetry into one WAM, each hypervisor runs a lightweight Logic Harvester: a Pengine server consulting a local copy of the node_health.pl KB and evaluating alert_condition/2 against its own node's metrics only. The Harvester asserts facts to the central Orchestrator only when a condition crosses a threshold — a transition from healthy to degraded or degraded to critical — rather than on every 15-second scrape cycle.

The consequence is a pull-based alert model where the central WAM receives pre-filtered, already-reasoned facts (node_alert(pve7, cpu_steal_critical, 94.2)) rather than raw metric streams. The Orchestrator's shortest_path/3 guard can act on the asserted alert immediately without a polling lag. The central TSDB retains the full raw metric history for Grafana dashboards and forensic queries; the Logic Harvesters exist only to accelerate the signal-to-decision path for cluster routing changes.

This architecture is a Chapter 23 build. The VictoriaMetrics deployment in this chapter is its prerequisite: the Harvester's local KB validation (cpu_steal_valid/2) uses the same rate-of-change guards described in §20.5.3, and the Pengine server it runs is an extension of the CGO worker pool pattern from Chapter 16.