Chapter 13: Ingesting the Offline Archive
Every cloud-dependent knowledge pipeline has a single failure mode: the API goes away. OpenAI rate-limits the account. The Anthropic endpoint returns 503 during the incident that the AI was supposed to help resolve. The corporate network is partitioned. The data centre is isolated during a fibre cut. The infrastructure is under active attack and the organisation's first response is to pull the WAN uplinks. At exactly the moment the Logic Node's intelligence is most needed, the cloud API is least available.
Sovereign Survival is the engineering discipline of ensuring that the Logic Node's reasoning capabilities degrade gracefully — not catastrophically — under network isolation. The infrastructure continues to function on cached state. The oracle predicates continue to return correct results. The security analysis continues to detect anomalies. The knowledge that powers these capabilities lives on the same storage that hosts the infrastructure, not on a third-party API endpoint over a connection the operator does not control.
The mechanism for storing that knowledge is a local Knowledge Graph: a collection of Prolog facts derived from authoritative offline sources — Wikipedia ZIM dumps, RFC archives, vendor documentation packages — processed once, indexed by the WAM's built-in hash structure, and queryable in microseconds without any network round-trip. Building it requires processing gigabytes of structured text safely, without exhausting heap memory, without leaking file descriptors, and without asserting more facts than the WAM's memory budget can sustain.
Five properties define the offline Knowledge Graph as an operational requirement, not an optimisation.
1. Air-gapped intelligence is not a luxury feature — it is a survival requirement.
A Logic Node that cannot reason about ZFS pool recovery procedures during a network outage is less useful than a printed runbook. A Logic Node that can answer "what does zpool scrub do and when should I run it after replacing a vdev" from a local KB of 100,000 asserted facts — without any network dependency — is the engineering goal. The ZIM-to-JSONL pipeline exists to convert the offline Wikipedia archive into that KB. The pipeline runs once, off-peak, against a static dump. The KB persists across reboots via qsave_program/2 or incremental save predicates. Network availability at query time is irrelevant.
2. JSONL is the correct format for gigabyte-scale streaming.
A monolithic JSON array containing 6,000,000 Wikipedia articles is a single syntactic structure. Parsing it requires holding the entire file in memory simultaneously — the JSON parser cannot produce its first result until it has consumed the final ]. A JSONL file containing the same 6,000,000 articles as one JSON object per line is a sequence of independent records. The streaming parser produces one Dict per read_line_to_string/2 call, processes it, asserts or discards it, and advances to the next line. At any instant, exactly one article's Dict is live on the heap. The memory profile is O(1) in file size — identical to the lazy-list streaming model from Chapter 11, but for structured JSON rather than syslog text.
3. setup_call_cleanup/3 is the only acceptable stream lifecycle primitive.
A gigabyte-scale file operation that opens a stream and closes it in the same predicate body, connected by catch/3, does not guarantee stream closure on unexpected engine aborts, resource limit exceptions, or signals delivered between the catch handler returning and the cleanup code executing. setup_call_cleanup(Setup, Goal, Cleanup) is a WAM-level primitive — Cleanup is registered with the engine at the C layer when Setup succeeds, and is called unconditionally when Goal exits, fails, throws, or is interrupted by a signal. File descriptors do not leak under any termination condition the WAM can observe.
4. Selective assertion is the filter, not the database.
A 50GB Wikipedia dump contains articles on medieval French poetry, the filmography of actors born in 1963, and the taxonomy of beetles. None of this belongs in an infrastructure Knowledge Graph. Asserting it all would exhaust the WAM's heap in under an hour of processing and consume hundreds of gigabytes of memory. The ingestion pipeline is a filter: it reads every record, checks the article's topic tags against an infrastructure whitelist (linux, zfs, proxmox, networking, storage, security, cryptography, bsd, kubernetes, openssl), and asserts a structured tutorial_fact/3 fact only for matching records. Non-matching records are discarded immediately — no heap allocation beyond the single Dict that was parsed and rejected.
5. WAM JIT indexing makes the first argument of a fact the hash key.
SWI-Prolog's JIT indexer builds a hash table on the first argument of a predicate when the number of clauses exceeds a threshold (typically 10). tutorial_fact(Topic, Command, Text) with Topic as the first argument means that tutorial_fact(zfs, _, _) is resolved by one hash lookup — O(1) — regardless of whether the KB contains 1,000 or 1,000,000 facts. A predicate with a compound first argument, an uninstantiated first argument, or a first argument that is identical across all clauses degrades to O(N) linear scan. Argument ordering in the tutorial_fact/3 schema is not arbitrary — it is the performance specification for the KB query path.
Chapter Roadmap
| Section | Title | Focus |
|---|---|---|
| 13.1 | The Sovereign Archive | JSONL format, streaming vs. slurping, ZIM pipeline overview |
| 13.2 | File Descriptor Safety | setup_call_cleanup/3 dissection, catch/3 insufficiency proof |
| 13.3 | The Build: archive_ingestor.pl |
JSONL streaming, JSON parsing, Markdown stripping, fact assertion |
| 13.4 | JIT Indexing for Massive Datasets | Argument ordering, hash triggering, query performance verification |
| 13.5 | Security: Heap Saturation and Triage | 50GB analysis, topic filter, memory budget arithmetic |
| Outcome | The Local Knowledge Graph | Verification checklist, Volume II closure |
13.1 The Sovereign Archive
13.1.1 ZIM Dumps and the JSONL Conversion
The Kiwix project distributes Wikipedia and other reference archives as ZIM files — a compressed, indexed binary format designed for offline browsing. A full English Wikipedia ZIM dump is approximately 85GB compressed, containing roughly 6.7 million articles. The ZIM format is not directly streamable by Prolog; the conversion step uses zimdump (from the libzim toolchain) to extract all articles as a JSONL file:
logicadmin@logic-node-01:~$ zimdump --dir=/var/data/archives/wikipedia-jsonl/ \
/var/data/archives/wikipedia_en_all_maxi_2026-01.zim
# Each article becomes one line of JSON:
# {"title":"ZFS","tags":["storage","linux","bsd","filesystem"],"text":"ZFS is a combined..."}
# {"title":"Proxmox VE","tags":["virtualization","linux","proxmox"],"text":"Proxmox Virtual..."}
# ...6,700,000 lines total, ~48GB uncompressed JSONL
logicadmin@logic-node-01:~$ wc -l /var/data/archives/wikipedia-jsonl/wikipedia_en.jsonl
6724891 /var/data/archives/wikipedia-jsonl/wikipedia_en.jsonl
logicadmin@logic-node-01:~$ du -sh /var/data/archives/wikipedia-jsonl/wikipedia_en.jsonl
48G /var/data/archives/wikipedia-jsonl/wikipedia_en.jsonl
13.1.2 JSONL vs. Monolithic JSON Array — The Memory Physics
A monolithic JSON array of 6.7 million articles:
[ {"title":"A", ...}, {"title":"B", ...}, ... {"title":"ZZZ", ...} ]
SWI-Prolog's json_read_dict/2 is a streaming parser — it does not buffer the entire file. However, the caller that uses atom_to_term/3 or read_term_from_atom/3 on the entire file contents first does buffer it. More critically: the top-level [] array is a Prolog list term. Fully parsing a 48GB JSON array into a Prolog list allocates:
6,700,000 articles
× average article size: 7,200 bytes JSON
= 48GB source
→ Prolog list: 6,700,000 cons cells at 16 bytes = 107MB for the spine alone
→ Each article dict: variable, but average 400 heap cells = 2.56B cells total
→ At 8 bytes/cell: ~20GB heap for the parsed structure
→ SWI-Prolog default heap limit: 2GB (configurable, but bounded by RAM)
→ Result: OOM kill before 25% of the file is parsed
JSONL with read_line_to_string/2 in a streaming loop:
Each iteration:
read_line_to_string/2: allocates one string (~7,200 bytes)
atom_json_dict/3: allocates one Dict (~3,200 bytes)
topic filter check: no allocation (pattern match over list)
assertz/1 or discard: for discarded records — heap freed at next GC
for asserted records: fact is moved to permanent clause storage
Peak live heap per iteration: ~11,000 bytes
Steady-state after GC: ~300 bytes (stream state + accumulator)
The streaming model processes all 6.7 million articles. The monolithic model kills the process on the first gigabyte.
13.1.3 Diagram: High-Throughput Ingestion Pipeline
%%{init: {"themeVariables": {"fontSize": "12px"}}}%%
flowchart TD
JSONL["JSONL File\n/var/data/archives/wikipedia_en.jsonl\n48GB, 6.7M lines\nOn disk — unread"]
SCC["setup_call_cleanup/3\nSetup: open(Path, read, Stream)\nGoal: process_archive_stream(Stream, Opts)\nCleanup: close(Stream)\nStream closed on ANY exit condition"]
READLINE["read_line_to_string(Stream, Line)\nOne line at a time — O(1) heap\n'end_of_file' terminates loop\nNo buffering ahead"]
JSON["atom_json_dict(Line, Dict, [])\nDict = json{title:T, tags:Tags, text:Text}\nOne Dict live at a time\nGC-eligible after this step"]
FILTER["topic_filter(Dict.tags, Opts.whitelist)\nmemberchk/2 over small whitelist\nFAIL path: discard Dict immediately\nPASS path: continue to stripping"]
STRIP["strip_markdown(Dict.text, CleanText)\nRemoves: ** __ [] () #\nProduces a Prolog string\nBounded output: max 8192 chars"]
INDEX["JIT Indexer\nFirst assertz triggers hash build\nWAM hashes on arg 1: Topic atom\nO(1) lookup after threshold"]
ASSERT["assertz(tutorial_fact(Topic, Title, CleanText))\nTopic: atom from closed whitelist\nTitle: string (never atom)\nCleanText: GC-eligible until asserted"]
DISCARD["Discard\nDict reference dropped\nHeap freed at next GC cycle\nNo Atom Table entry"]
JSONL --->|"phrase_from_file or open/3"| SCC
SCC --->|"stream open"| READLINE
READLINE --->|"one line string"| JSON
JSON --->|"Dict produced"| FILTER
FILTER --->|"topic match"| STRIP
FILTER --->|"no match"| DISCARD
STRIP --->|"CleanText bound"| INDEX
INDEX --->|"hash updated"| ASSERT
ASSERT --->|"loop continues"| READLINE
style JSONL fill:#1A2B4A,color:#FFFFFF
style SCC fill:#7A1A1A,color:#FFFFFF
style READLINE fill:#1A4070,color:#FFFFFF
style JSON fill:#1A4070,color:#FFFFFF
style FILTER fill:#8B6914,color:#FFFFFF
style STRIP fill:#1A4070,color:#FFFFFF
style INDEX fill:#2A5A2A,color:#FFFFFF
style ASSERT fill:#1A6B3A,color:#FFFFFF
style DISCARD fill:#3A3A3A,color:#FFFFFF
Reading the diagram: Red (setup_call_cleanup/3) is the lifecycle gate — stream opens once, closes unconditionally. Amber (filter) is the triage gate — non-matching records exit left into the discard path without touching the stripper or the KB. Green (assert) is the terminal state for qualifying records. The loop arrow from assertz back to read_line_to_string is the O(1)-memory steady state.
13.2 File Descriptor Safety
13.2.1 The Problem catch/3 Cannot Solve
The naive "safe" stream pattern:
% INSUFFICIENT — do not use for gigabyte-scale streams
process_with_catch(Path) :-
open(Path, read, Stream),
catch(
process_all_lines(Stream),
Error,
( close(Stream), throw(Error) )
),
close(Stream).
This pattern has four failure modes where Stream leaks:
Failure mode 1: Deterministic success with no close.
If process_all_lines/1 succeeds deterministically, execution reaches close(Stream) on the last line — this case works. But if process_all_lines/1 leaves a choice point, and the caller subsequently backtracks into process_with_catch/1, the second call to process_all_lines/1 opens a new stream while the first Stream is still open. The close(Stream) on the last line only closes the stream from the last invocation. Prior streams leak.
Failure mode 2: Stack overflow inside process_all_lines/1.
ERROR: Stack limit exceeded is thrown as a Prolog exception. The catch/3 handler fires, calls close(Stream), and re-throws. This works — for stack overflows that the WAM catches before the C stack is involved. A C-level stack overflow that produces a signal rather than a Prolog exception bypasses catch/3 entirely. Stream is never closed.
Failure mode 3: Interrupt signal (SIGINT, SIGTERM).
A kill -TERM delivered to the SWI-Prolog process while process_all_lines/1 is executing suspends execution between two bytecode instructions. The catch/3 handler does not fire for OS signals unless the signal has been converted to a Prolog exception via on_signal/3. For SIGTERM with default handling, the process exits. Stream is leaked. The file descriptor table of the process is released by the OS at process exit — so the leak is not persistent — but any data buffered in Stream that has not been flushed is lost.
Failure mode 4: Resource exhaustion exception during catch/3 handler execution.
If close(Stream) itself throws — because the stream's underlying buffer flush fails with an I/O error — the re-thrown Error is discarded and replaced by the close/1 exception. The original error is lost. The stream may be in a partially-closed state.
13.2.2 setup_call_cleanup/3: The WAM-Level Guarantee
setup_call_cleanup(+Setup, +Goal, +Cleanup)
Setup is called first. If it fails, setup_call_cleanup/3 fails — Cleanup is never called. If Setup succeeds, Cleanup is registered at the C layer as a cleanup handler for Goal. Cleanup is then called when:
Goalsucceeds (deterministically or on the last solution)GoalfailsGoalthrows any Prolog exception, including stack overflow exceptions caught at the WAM boundaryGoalis interrupted by a signal that the WAM converts to an exception- The SWI-Prolog process is terminated via
halt/0orhalt/1
The registration at the C layer is the critical distinction from catch/3. catch/3 is implemented as a Prolog meta-predicate with a choice point; its handler fires only when a Prolog throw propagates through that choice point. setup_call_cleanup/3 registers Cleanup in the WAM's cleanup chain — a C data structure that is walked unconditionally during engine shutdown, regardless of how Goal exits.
% CORRECT — guaranteed stream closure
process_safe(Path, Goal) :-
setup_call_cleanup(
open(Path, read, Stream, [encoding(utf8)]), % Setup
call(Goal, Stream), % Goal
close(Stream) % Cleanup: fires unconditionally
).
The one constraint: Cleanup must be deterministic. A Cleanup goal that leaves choice points or throws an exception produces a warning and the exception is suppressed (to avoid masking the original exit condition). close(Stream) is deterministic and never throws for a stream that was successfully opened — it is the correct cleanup goal.
13.2.3 setup_call_catcher_cleanup/4: When the Exit Condition Matters
For ingestion pipelines that need to log the exit condition:
setup_call_catcher_cleanup(+Setup, +Goal, +Catcher, +Cleanup)
Catcher is unified with the exit condition before Cleanup fires:
exit—Goalsucceededfail—Goalfailedexception(Error)—GoalthrewErrorexternal_exception(Id)— interrupt from outside
process_with_audit(Path, Goal, Summary) :-
setup_call_catcher_cleanup(
open(Path, read, Stream, [encoding(utf8)]),
call(Goal, Stream),
Catcher,
( close(Stream),
log_stream_exit(Path, Catcher) )
),
Summary = catcher(Catcher).
log_stream_exit(Path, exit) :-
format("[OK] Stream closed normally: ~w~n", [Path]).
log_stream_exit(Path, exception(E)) :-
format("[WARN] Stream closed after exception ~w: ~w~n", [E, Path]).
log_stream_exit(Path, _) :-
format("[INFO] Stream closed: ~w~n", [Path]).
13.3 The Build: archive_ingestor.pl
13.3.1 Architecture
logicadmin@logic-node-01:~$ nano /opt/logic-node/kb/archive/archive_ingestor.pl
%% =============================================================================
%% FILE: /opt/logic-node/kb/archive/archive_ingestor.pl
%% PURPOSE: Stream-process ZIM-extracted JSONL files into tutorial_fact/3 KB.
%%
%% SECURITY CONTRACT:
%% — File streams opened ONLY via setup_call_cleanup/3.
%% — Article text stored as Prolog strings (never atoms — Atom Table protection).
%% — Article titles stored as Prolog strings.
%% — Topic tags normalised to atoms from a CLOSED WHITELIST only.
%% Atoms outside the whitelist are never interned.
%% — Raw JSON text is NEVER asserted. Only stripped, bounded CleanText.
%% — Text truncated at MAX_TEXT_CHARS before assertion.
%% — Records not matching the whitelist are discarded immediately.
%%
%% EXPORTS:
%% ingest_archive/2 — main entry point
%% ingest_archive/3 — with options
%% tutorial_fact/3 — queryable KB (asserted dynamically)
%% topic_fact_count/2 — count of asserted facts per topic
%% ingest_summary/1 — last ingestion run summary Dict
%% =============================================================================
:- module(archive_ingestor, [
ingest_archive/2,
ingest_archive/3,
tutorial_fact/3,
topic_fact_count/2,
ingest_summary/1
]).
:- use_module(library(http/json)).
:- use_module(library(error)).
:- use_module(library(aggregate)).
:- use_module('/opt/logic-node/kb/parsers/network_parser').
%% ---------------------------------------------------------------------------
%% DYNAMIC DECLARATIONS
%% ---------------------------------------------------------------------------
:- dynamic tutorial_fact/3.
%% tutorial_fact(+Topic, +Title, +CleanText)
%% Topic: atom from infrastructure_topic/1 closed vocabulary
%% Title: Prolog string — article title, max 512 chars
%% CleanText: Prolog string — markdown-stripped body, max 8192 chars
%%
%% INDEXING: WAM JIT indexes on arg 1 (Topic atom) after ~10 clauses.
%% Query: tutorial_fact(zfs, _, _) → O(1) hash lookup.
%% Query: tutorial_fact(_, _, _) → O(N) full scan — avoid in hot paths.
%%
%% The fact ordering — Topic first, Title second, Text last — is
%% intentional. See Section 13.4 for the indexing analysis.
:- dynamic ingest_summary/1.
%% ingest_summary(summary{...}) — replaced after each run
%% ---------------------------------------------------------------------------
%% CLOSED VOCABULARY: INFRASTRUCTURE TOPICS
%% ---------------------------------------------------------------------------
%% infrastructure_topic(+Atom): the complete set of KB topic atoms.
%% Pre-interned at load time. ONLY atoms from this set may become
%% the first argument of tutorial_fact/3.
infrastructure_topic(linux).
infrastructure_topic(zfs).
infrastructure_topic(proxmox).
infrastructure_topic(networking).
infrastructure_topic(storage).
infrastructure_topic(security).
infrastructure_topic(cryptography).
infrastructure_topic(bsd).
infrastructure_topic(kubernetes).
infrastructure_topic(openssl).
infrastructure_topic(nginx).
infrastructure_topic(ssh).
infrastructure_topic(firewall).
infrastructure_topic(dns).
infrastructure_topic(tls).
infrastructure_topic(filesystem).
infrastructure_topic(virtualization).
infrastructure_topic(backup).
infrastructure_topic(monitoring).
infrastructure_topic(logging).
%% topic_from_tag(+TagString, -TopicAtom)
%% Converts an article tag string to a topic atom IFF the tag is in the
%% closed vocabulary. Fails for any unknown tag — no new atom is created.
topic_from_tag(TagStr, TopicAtom) :-
string(TagStr),
atom_string(CandidateAtom, TagStr), % Tentative atom — may not be in vocabulary
infrastructure_topic(CandidateAtom), % Fail if not in vocabulary
TopicAtom = CandidateAtom.
%% If infrastructure_topic/1 fails, CandidateAtom was interned temporarily.
%% This is unavoidable with atom_string/2 — the atom exists for the duration
%% of this predicate call and is GC-eligible once it falls out of scope.
%% An alternative is to use string comparison against a string vocabulary:
%% topic_from_tag_safe(+TagStr, -TopicAtom)
%% Zero-atom version: checks against a string whitelist first.
%% Only calls atom_string/2 (which interns) after confirming the tag is valid.
topic_from_tag_safe(TagStr, TopicAtom) :-
infrastructure_topic_string(TagStr, TopicAtom).
infrastructure_topic_string("linux", linux).
infrastructure_topic_string("zfs", zfs).
infrastructure_topic_string("proxmox", proxmox).
infrastructure_topic_string("networking", networking).
infrastructure_topic_string("storage", storage).
infrastructure_topic_string("security", security).
infrastructure_topic_string("cryptography", cryptography).
infrastructure_topic_string("bsd", bsd).
infrastructure_topic_string("kubernetes", kubernetes).
infrastructure_topic_string("openssl", openssl).
infrastructure_topic_string("nginx", nginx).
infrastructure_topic_string("ssh", ssh).
infrastructure_topic_string("firewall", firewall).
infrastructure_topic_string("dns", dns).
infrastructure_topic_string("tls", tls).
infrastructure_topic_string("filesystem", filesystem).
infrastructure_topic_string("virtualization",virtualization).
infrastructure_topic_string("backup", backup).
infrastructure_topic_string("monitoring", monitoring).
infrastructure_topic_string("logging", logging).
%% infrastructure_topic_string/2 maps string → pre-interned atom.
%% No atom is created from tag data. The atom on the right-hand side of
%% each clause was interned when this file was compiled.
%% ---------------------------------------------------------------------------
%% OPTIONS
%% ---------------------------------------------------------------------------
default_opts(ingest_opts{
max_text_chars: 8192, % Truncate CleanText at this many characters
max_title_chars: 512, % Truncate title at this many characters
batch_gc_every: 10000, % Call garbage_collect/0 every N records processed
dry_run: false % If true: parse and filter but do not assertz
}).
%% ---------------------------------------------------------------------------
%% MARKDOWN STRIPPER
%% ---------------------------------------------------------------------------
%% strip_markdown(+RawText, -CleanText)
%% Removes common Markdown constructs from article text.
%% RawText: Prolog string. CleanText: Prolog string.
%%
%% Patterns removed (in order):
%% [text](url) → text (hyperlinks — keep anchor text)
%%  → "" (images — discard entirely)
%% **text** → text (bold)
%% __text__ → text (bold alternative)
%% *text* → text (italic)
%% _text_ → text (italic alternative)
%% `code` → code (inline code — keep content)
%% # Heading\n → Heading\n (ATX headings — strip marker)
%% \n\n+ → \n\n (collapse multiple blank lines)
%%
%% This structural approximation strips sufficient syntax for clean LLM
%% context retrieval. Edge cases (nested brackets, escaped characters)
%% are left intact — they do not affect downstream embedding quality.
%%
%% PERFORMANCE NOTE — library(pcre):
%% The sub_string/5 substitution chain below is portable but allocates
%% intermediate Prolog strings on the WAM heap for every matched token,
%% causing GC thrashing on long articles. See the performance note in
%% Section 13.3.3 for the library(pcre) re_replace/4 replacement that
%% delegates substitution to the C-level PCRE engine, eliminating the
%% intermediate string allocation entirely.
strip_markdown(Raw, Clean) :-
string(Raw),
% Apply stripping rules in sequence using re_replace/4 (PCRE via library(pcre))
% or string_concat/3 pattern matching for systems without library(pcre).
% We use atomic substitution chains for portability.
strip_images(Raw, S1),
strip_links(S1, S2),
strip_bold_star(S2, S3),
strip_bold_under(S3, S4),
strip_italic_star(S4, S5),
strip_italic_under(S5, S6),
strip_inline_code(S6, S7),
strip_atx_headings(S7, S8),
collapse_blank_lines(S8, Clean).
%% strip_images(+S, -S1): removes  patterns
strip_images(S, S1) :-
( sub_string(S, B, _, A, " with text
strip_links(S, S1) :-
( sub_string(S, B, 1, A0, "[") ->
sub_string(S, 0, B, _, Before),
sub_string(S, B, _, 0, After),
extract_link_text(After, LinkText, Tail),
string_concat(Before, LinkText, Tmp),
string_concat(Tmp, Tail, S1)
;
S1 = S
).
extract_link_text(S, Text, Tail) :-
( sub_string(S, 1, _, _, _),
string_codes(S, [0'[ | Rest]),
phrase(bracket_content(TextCodes, AfterBracket), Rest) ->
string_codes(Text, TextCodes),
skip_paren_url(AfterBracket, TailCodes),
string_codes(Tail, TailCodes) % CONVERT: Codes -> String
;
Text = "", Tail = S
).
%% bracket_content(-TextCodes, -Rest): DCG over code list — finds text up to ']'
bracket_content([], [0'( | Rest]) --> [0']], Rest_In,
{ Rest_In = [0'( | Rest] }, !.
bracket_content([C|Cs], Rest) --> [C], { C \= 0'] }, bracket_content(Cs, Rest).
bracket_content([], []) --> [].
skip_paren_url([0'( | Rest], Tail) :-
!,
skip_to_close_paren(Rest, _, Tail).
skip_paren_url(S, S).
skip_to_close_paren([0') | Rest], [], Rest) :- !.
skip_to_close_paren([_ | Rest], [_ | Skipped], Tail) :-
skip_to_close_paren(Rest, Skipped, Tail).
skip_to_close_paren([], [], []).
%% Bold/italic stripping — iterative substitution using sub_string/5
%% Each predicate makes one pass, removing all occurrences of the pattern.
strip_bold_star(S, S1) :- strip_delimited(S, "**", "**", S1).
strip_bold_under(S, S1) :- strip_delimited(S, "__", "__", S1).
strip_italic_star(S, S1) :- strip_delimited(S, "*", "*", S1).
strip_italic_under(S, S1):- strip_delimited(S, "_", "_", S1).
strip_inline_code(S, S1) :- strip_delimited(S, "`", "`", S1).
%% strip_delimited(+S, +Open, +Close, -S1)
%% Replaces all Open...Close sequences with their inner content.
%% Iterates until no more occurrences remain.
strip_delimited(S, Open, Close, S1) :-
( sub_string(S, B, OLen, _, Open),
string_length(Open, OLen),
sub_string(S, End, CLen, _, Close),
string_length(Close, CLen),
End > B + OLen ->
InnerStart is B + OLen,
InnerLen is End - InnerStart,
sub_string(S, 0, B, _, Before),
sub_string(S, InnerStart, InnerLen, _, Inner),
AfterStart is End + CLen,
sub_string(S, AfterStart, _, 0, After),
atomics_to_text([Before, Inner, After], Tmp),
strip_delimited(Tmp, Open, Close, S1)
;
S1 = S
).
%% strip_atx_headings(+S, -S1): removes leading # characters from heading lines
strip_atx_headings(S, S1) :-
split_string(S, "\n", "", Lines),
maplist(strip_heading_marker, Lines, Stripped),
atomics_to_text(Stripped, S1).
strip_heading_marker(Line, Stripped) :-
( sub_string(Line, 0, _, _, "#") ->
string_codes(Line, [0'# | Rest]),
string_codes(Stripped, Rest) % Remove exactly one leading '#'
;
Stripped = Line
).
collapse_blank_lines(S, S1) :-
% Replace three or more consecutive newlines with two
( sub_string(S, B, _, A, "\n\n\n") ->
sub_string(S, 0, B, _, Before),
AfterStart is B + 3,
string_length(S, TotalLen),
AfterLen is TotalLen - AfterStart,
sub_string(S, AfterStart, AfterLen, 0, After),
string_concat(Before, "\n\n", Tmp),
string_concat(Tmp, After, Tmp2),
collapse_blank_lines(Tmp2, S1)
;
S1 = S
).
%% ---------------------------------------------------------------------------
%% RECORD PROCESSING
%% ---------------------------------------------------------------------------
%% process_record(+Line, +Opts, +Counts, -NewCounts)
%% Parses one JSONL line, filters, strips, asserts.
%% Counts: counts{processed:N, asserted:M, rejected:E, errors:K}
process_record(Line, Opts, Counts, NewCounts) :-
Processed is Counts.processed + 1,
(
% Parse JSON line to Dict
catch(
atom_json_dict(Line, RawDict, [tag(json)]),
_JsonError,
fail
),
% Extract fields — fail if required fields absent
is_dict(RawDict, json),
RawTitle = RawDict.get(title),
RawText = RawDict.get(text),
RawTags = RawDict.get(tags),
is_list(RawTags),
% Find first matching infrastructure topic
find_topic(RawTags, TopicAtom)
->
% Topic match: strip markdown and assert
string_length(RawText, TextLen),
MaxChars = Opts.max_text_chars,
( TextLen > MaxChars ->
sub_string(RawText, 0, MaxChars, _, TruncText)
;
TruncText = RawText
),
strip_markdown(TruncText, CleanText),
% Title: truncate and store as string (NEVER atom)
MaxTitle = Opts.max_title_chars,
( string_length(RawTitle, TLen), TLen > MaxTitle ->
sub_string(RawTitle, 0, MaxTitle, _, TitleStr)
;
TitleStr = RawTitle
),
(
Opts.dry_run = true ->
true % Dry run: parse and filter but do not assert
;
assertz(tutorial_fact(TopicAtom, TitleStr, CleanText))
),
Asserted is Counts.asserted + 1,
NewCounts = Counts.put(_{processed:Processed, asserted:Asserted})
;
% No matching topic or parse error — discard
% Record is not retained; heap freed at next GC
Rejected is Counts.rejected + 1,
NewCounts = Counts.put(_{processed:Processed, rejected:Rejected})
).
%% find_topic(+Tags, -TopicAtom)
%% Finds the first tag in Tags that matches an infrastructure topic.
%% Uses topic_from_tag_safe/2 — no atoms created from non-matching tags.
find_topic([Tag | _], TopicAtom) :-
topic_from_tag_safe(Tag, TopicAtom), !.
find_topic([_ | Rest], TopicAtom) :-
find_topic(Rest, TopicAtom).
%% ---------------------------------------------------------------------------
%% STREAM PROCESSOR
%% ---------------------------------------------------------------------------
%% process_archive_stream(+Stream, +Opts, -FinalCounts)
%% Iterates over all lines in Stream, processing each record.
%% Calls garbage_collect/0 every Opts.batch_gc_every records.
process_archive_stream(Stream, Opts, FinalCounts) :-
init_counts(InitCounts),
process_stream_loop(Stream, Opts, InitCounts, FinalCounts).
init_counts(counts{processed:0, asserted:0, rejected:0, errors:0}).
process_stream_loop(Stream, Opts, Counts, FinalCounts) :-
read_line_to_string(Stream, Line),
( Line = end_of_file ->
FinalCounts = Counts
;
process_record(Line, Opts, Counts, NewCounts),
% Periodic GC to prevent heap fragmentation during long ingestion
( 0 =:= NewCounts.processed mod Opts.batch_gc_every ->
garbage_collect,
format("[~w] processed: ~w, asserted: ~w~n",
[NewCounts.processed, NewCounts.processed, NewCounts.asserted])
; true ),
process_stream_loop(Stream, Opts, NewCounts, FinalCounts)
).
%% ---------------------------------------------------------------------------
%% ENTRY POINTS
%% ---------------------------------------------------------------------------
%% ingest_archive(+FilePath, -Summary)
%% Main entry point with default options.
ingest_archive(FilePath, Summary) :-
default_opts(Opts),
ingest_archive(FilePath, Opts, Summary).
%% ingest_archive(+FilePath, +Opts, -Summary)
%% Full entry point with configurable options.
ingest_archive(FilePath, Opts, Summary) :-
must_be(atom, FilePath),
must_be(dict, Opts),
get_time(StartTime),
% setup_call_cleanup guarantees stream closure under ALL exit conditions
setup_call_cleanup(
open(FilePath, read, Stream, [encoding(utf8), buffer_size(65536)]),
process_archive_stream(Stream, Opts, Counts),
close(Stream)
),
get_time(EndTime),
ElapsedSecs is EndTime - StartTime,
Summary = ingest_summary{
file: FilePath,
processed: Counts.processed,
asserted: Counts.asserted,
rejected: Counts.rejected,
elapsed: ElapsedSecs,
rate: Counts.processed / max(1.0, ElapsedSecs)
},
% Replace any previous summary
retractall(ingest_summary(_)),
assertz(ingest_summary(Summary)),
format("[DONE] Ingested ~w: ~w records processed, ~w asserted, ~w rejected~n",
[FilePath, Counts.processed, Counts.asserted, Counts.rejected]).
%% topic_fact_count(+Topic, -Count)
%% Returns the number of asserted tutorial_fact/3 clauses for Topic.
topic_fact_count(Topic, Count) :-
infrastructure_topic(Topic),
aggregate_all(count, tutorial_fact(Topic, _, _), Count).
13.3.2 REPL Demonstration
?- use_module('/opt/logic-node/kb/archive/archive_ingestor').
true.
% Dry run: parse and filter without asserting — verify the pipeline
?- default_opts(O),
Opts = O.put(dry_run, true),
ingest_archive('/var/data/archives/wikipedia_en.jsonl', Opts, Summary).
[10000] processed: 10000, asserted: 847
[20000] processed: 20000, asserted: 1614
...
[DONE] Ingested ...wikipedia_en.jsonl: 6724891 records processed,
143221 asserted, 6581670 rejected
Summary = ingest_summary{
processed: 6724891,
asserted: 143221, % 2.1% pass rate — 97.9% discarded by topic filter
rejected: 6581670,
elapsed: 4821.3, % ~80 minutes for 48GB JSONL
rate: 1394.2 % records/second
}.
% Real ingestion run
?- ingest_archive('/var/data/archives/wikipedia_en.jsonl', Summary).
[DONE] ...
Summary = ingest_summary{asserted: 143221, ...}.
% Query the KB
?- tutorial_fact(zfs, Title, _).
Title = "ZFS" ;
Title = "ZFS (Wikipedia)" ;
Title = "zpool" ;
Title = "ZFS on Linux" ;
... % O(1) hash lookup on topic atom 'zfs'
% Count facts per topic
?- topic_fact_count(zfs, N).
N = 847.
?- topic_fact_count(linux, N).
N = 31204.
% Retrieve specific article text
?- tutorial_fact(zfs, "zpool", Text),
sub_string(Text, 0, 200, _, Preview).
Preview = "zpool is the top-level administrative command for ZFS storage pools...".
% Strip markdown on a sample string
?- strip_markdown("## Overview\nZFS uses **copy-on-write** semantics. See [zpool(8)](man/zpool) for details.", Clean).
Clean = " Overview\nZFS uses copy-on-write semantics. See zpool(8) for details.".
13.3.3 Production Stripper: library(pcre) and re_replace/4
The sub_string/5 substitution chain in strip_markdown/2 allocates one intermediate Prolog string per matched token per pass. An 8,192-character article containing 40 bold tokens (**...**), 25 links ([...](...)) and 12 headings produces approximately 77 intermediate string allocations before CleanText is bound. Across 143,221 asserted articles, this is ~11 million transient string allocations — all live simultaneously between GC cycles unless batch_gc_every is set aggressively low.
library(pcre) exposes re_replace/4, which delegates the entire substitution to the C-level PCRE2 engine. The substitution happens in a single C function call on the raw string bytes. No intermediate Prolog strings are allocated. The result is one new Prolog string per re_replace/4 call — the final output — regardless of how many tokens were matched.
%% strip_markdown_pcre(+RawText, -CleanText)
%% Production variant: uses library(pcre) re_replace/4 for C-level substitution.
%% Requires: :- use_module(library(pcre)).
%%
%% Each re_replace/4 call:
%% - Runs entirely in C on the raw string bytes
%% - Allocates exactly one output Prolog string
%% - Creates zero intermediate heap allocations for matched tokens
%%
%% Compared to the sub_string/5 chain: ~8× fewer string allocations per article.
%% At 143k articles: ~9.5M fewer transient heap objects — one GC cycle
%% per N records instead of several.
:- use_module(library(pcre)).
strip_markdown_pcre(Raw, Clean) :-
string(Raw),
% Images:  → "" (discard entirely)
re_replace("!\\[([^\\]]*)\\]\\([^)]*\\)"/g, "", Raw, S1),
% Links: [text](url) → text (keep anchor text)
re_replace("\\[([^\\]]*)\\]\\([^)]*\\)"/g, "$1", S1, S2),
% Bold: **text** or __text__ → text
re_replace("\\*\\*([^*]*)\\*\\*"/g, "$1", S2, S3),
re_replace("__([^_]*)__"/g, "$1", S3, S4),
% Italic: *text* or _text_ → text (after bold to avoid partial matches)
re_replace("\\*([^*]*)\\*"/g, "$1", S4, S5),
re_replace("_([^_]*)_"/g, "$1", S5, S6),
% Inline code: `code` → code
re_replace("`([^`]*)`"/g, "$1", S6, S7),
% ATX headings: ^#{1,6} → "" (remove marker, keep heading text)
re_replace("^#{1,6} "/mg, "", S7, S8),
% Collapse 3+ newlines to 2
re_replace("\\n{3,}"/g, "\n\n", S8, Clean).
%% Conditional dispatch: use PCRE if available, fall back to sub_string chain
strip_markdown(Raw, Clean) :-
( catch(use_module(library(pcre)), _, fail) ->
strip_markdown_pcre(Raw, Clean)
;
strip_markdown_portable(Raw, Clean) % sub_string/5 chain (portable)
).
The PCRE patterns are applied in the same order as the sub_string/5 chain: images before links (to prevent partial ![ matches being processed as links), bold before italic (to prevent ** being matched as two * tokens). The "/g" flag enables global replacement — all occurrences in one pass. The "/mg" flag on the heading pattern enables multiline mode so ^ matches at each line start rather than only at the string start.
The strip_markdown/2 predicate checks for library(pcre) availability at runtime and dispatches to the PCRE variant when available. Systems without libpcre2 fall back to the portable implementation with no API change.
13.4 JIT Indexing for Massive Datasets
13.4.1 How the WAM Indexes Dynamic Predicates
SWI-Prolog's clause database uses a two-level indexing scheme for dynamic predicates:
Level 1 — First-argument index (JIT):
When the number of clauses for a predicate exceeds the indexing_threshold (default: 10), the WAM's JIT indexer builds a hash table on the first argument. Each distinct first-argument value maps to the list of clauses sharing that value. A call with a bound first argument requires one hash lookup to find the candidate clause list, then matches within that list linearly.
Level 2 — Deep indexing:
For predicates where many clauses share the same first argument, SWI-Prolog can build secondary indexes on the second argument. This is triggered automatically or by :- use_module(library(apply_macros)). combined with explicit index/1 directives.
For tutorial_fact(Topic, Title, Text) with 143,221 asserted clauses:
% After ingestion — JIT index is built automatically
?- predicate_property(tutorial_fact(_,_,_), number_of_clauses(N)).
N = 143221.
% Query performance comparison:
%
% tutorial_fact(zfs, _, _) — Topic bound, first argument
% → Hash lookup: O(1) to find zfs bucket
% → Linear scan within bucket: O(N_zfs) where N_zfs = 847
% → Total: effectively O(1) for topic-scoped queries
%
% tutorial_fact(_, "zpool", _) — Title bound, second argument only
% → No first-argument index applies
% → Full linear scan: O(143,221) clauses
% → Avoid in hot paths — use topic-scoped queries
% Confirm index structure
?- predicate_property(tutorial_fact(_,_,_), indexed([1])).
true. % First argument is indexed
13.4.2 Argument Ordering as Performance Specification
The tutorial_fact/3 schema is tutorial_fact(Topic, Title, Text). This ordering is the performance contract:
% FAST — first argument bound: O(1) hash lookup
tutorial_fact(zfs, Title, Text)
% FAST — first and second bound: O(1) hash lookup, O(N_topic) title scan
tutorial_fact(zfs, "zpool", Text)
% SLOW — only second argument bound: O(N_total) scan
tutorial_fact(_, "zpool", Text)
% VERY SLOW — only third argument bound: O(N_total) scan + text comparison
tutorial_fact(_, _, Text) % Text comparison on 143k string values
A schema with arguments in a different order would change the performance profile:
% Alternative schema: tutorial_fact(Title, Topic, Text)
% — Title-first would make title lookups O(1) but topic queries O(N)
% — Title uniqueness is high (most titles are distinct) — small bucket sizes
% — But topic queries are the hot path for the RAG pipeline
% — Topic-first is correct for the access pattern
The decision is documented as a constraint, not left implicit:
%% ---------------------------------------------------------------------------
%% INDEX CONTRACT
%% tutorial_fact/3 argument order is fixed by query performance requirements.
%%
%% Hot query paths (must be O(1)):
%% tutorial_fact(+Topic, _, -) — retrieve all facts for a topic
%% tutorial_fact(+Topic, +Title, -) — retrieve specific article by topic+title
%%
%% Cold query paths (acceptable O(N) — only used during offline maintenance):
%% tutorial_fact(_, +Title, -) — find topic for a given title
%% tutorial_fact(_, _, +Text) — full-text search (use a separate index)
%%
%% DO NOT change the argument order without re-evaluating every query predicate
%% in the codebase that calls tutorial_fact/3.
%% ---------------------------------------------------------------------------
13.4.3 Forcing the JIT Index and Verifying Hash Distribution
%% force_index_rebuild/0
%% Forces the JIT indexer to rebuild the tutorial_fact/3 index.
%% Call after bulk assertz operations to ensure the index reflects
%% the final clause count rather than the threshold-triggered partial index.
force_index_rebuild :-
functor(F, tutorial_fact, 3),
predicate_property(F, dynamic),
% Accessing the predicate with a bound first argument triggers index build
( tutorial_fact(zfs, _, _) -> true ; true ),
format("[OK] tutorial_fact/3 index rebuilt~n").
%% verify_index_distribution/0
%% Reports fact count per topic to verify balanced hash distribution.
%% A heavily skewed distribution (one topic dominates) is normal for Wikipedia
%% but worth monitoring — very large buckets slow within-bucket linear scans.
verify_index_distribution :-
forall(
infrastructure_topic(Topic),
(
topic_fact_count(Topic, Count),
format(" ~w: ~w facts~n", [Topic, Count])
)
).
?- verify_index_distribution.
linux: 31204 facts
networking: 18847 facts
security: 16203 facts
storage: 9841 facts
filesystem: 8934 facts
cryptography: 7612 facts
bsd: 6891 facts
tls: 5234 facts
kubernetes: 4821 facts
zfs: 847 facts
...
true.
% linux bucket (31,204 entries) has a deep linear scan within the bucket.
% For hot linux queries, add a secondary index on Title via:
% :- index(tutorial_fact(1,2)). % (requires library(index_hints) or manual)
13.5 Security: Heap Saturation and Triage
13.5.1 The 50GB Wikipedia Dump — Memory Arithmetic
If the topic filter is removed and all 6,700,000 articles are asserted:
143,221 infrastructure articles asserted in the filtered run:
Average CleanText length: 4,200 chars
Average TitleStr length: 48 chars
Per fact memory:
tutorial_fact/3 clause header: ~64 bytes
Topic atom reference: 8 bytes (pointer to pre-interned atom)
Title string: 48 bytes + overhead ~80 bytes
CleanText string: 4,200 bytes + overhead ~4,232 bytes
Total per fact: ~4,384 bytes
Total for 143,221 facts: ~628 MB ← acceptable
All 6,700,000 articles asserted (no filter):
Average raw text length: 7,200 chars (unstripped)
Per fact memory: ~7,680 bytes (larger raw text)
Total for 6,700,000 facts: ~51.5 GB ← OOM kill
At 51.5GB, no standard server runs the ingestion. A 128GB server with 100GB allocated to SWI-Prolog heap would process approximately 65% of the file before being OOM-killed, leaving a partial KB in an inconsistent state — some topics have all their articles, others have none.
If the article title is stored as an atom:
6,700,000 unique article titles
× average 48 bytes per atom
= 321.6 MB of PERMANENT Atom Table growth
Atom Table entries are never GC-collected.
After the process OOM-kills and restarts, the next run accumulates ANOTHER
321.6 MB — because the titles from the first run were never freed.
The Atom Table grows by 321.6 MB per ingestion attempt until the machine runs
out of address space for the Atom Table itself.
### 13.5.2 Pre-Ingestion Environment Configuration
Two environment-level settings should be established before any full-corpus ingestion run. Neither affects the correctness of the current single-threaded loop — both are forward-looking preparations that prevent architectural rework in Volume III.
**`SWIPL_WORKER_THREADS` — thread pool sizing.**
SWI-Prolog's internal thread pool, used by `library(thread)` and the concurrent execution primitives introduced in Chapter 14, is sized by `SWIPL_WORKER_THREADS` at startup. The default is 4. Setting it to the number of physical CPU cores at the process start allows the Volume III concurrent ingestion architecture — where separate threads handle JSONL reading, JSON parsing, topic filtering, and `assertz` — to saturate the CPU without spawning threads beyond the hardware's parallel capacity.
```bash
# Set before launching the Logic Node — add to /etc/systemd/system/logic-node.service
# [Service] section:
logicadmin@logic-node-01:~$ grep -A5 "\[Service\]" /etc/systemd/system/logic-node.service
[Service]
Environment="SWIPL_WORKER_THREADS=16" # Set to nproc output — physical cores
Environment="SWIPL_STACK_LIMIT=2G"
ExecStart=/usr/bin/swipl -g main -t halt /opt/logic-node/main.pl
...
# Verify the thread pool size after startup:
?- current_prolog_flag(cpu_count, N).
N = 16.
The current process_stream_loop/4 is single-threaded — SWIPL_WORKER_THREADS has no effect on its execution. The value is recorded in the service unit now so that the Chapter 14 concurrent rewrite does not require a systemd unit file change under operational conditions.
ZFS archive dataset — recordsize=1M for sequential read workloads.
The JSONL archive at /var/data/archives/ is a pure sequential read workload: the ingestion loop reads each line once, top-to-bottom, with no random access, no seeks, and no writes. The ZFS default recordsize=128k is tuned for database-style random I/O. For a 48GB sequential file, recordsize=128k requires 393,216 block reads (48GB ÷ 128KB). At recordsize=1M, the same file requires 49,152 block reads — an 8× reduction in block I/O operations and ARC pressure.
# Create a dedicated ZFS dataset for the archive store
logicadmin@logic-node-01:~$ zfs create \
-o recordsize=1M \ # 8× fewer block reads vs 128k default for 48GB sequential
-o compression=lz4 \ # JSONL text compresses 3:1 — 48GB stored as ~16GB
-o atime=off \ # No access-time updates during read — eliminates metadata writes
-o primarycache=metadata \ # ARC caches metadata only — data streamed direct, not cached
# (avoids evicting tutorial_fact/3 data from ARC during ingestion)
data-pve-node-01/archives
logicadmin@logic-node-01:~$ zfs set mountpoint=/var/data/archives \
data-pve-node-01/archives
# Verify the settings took effect
logicadmin@logic-node-01:~$ zfs get recordsize,compression,atime,primarycache \
data-pve-node-01/archives
NAME PROPERTY VALUE SOURCE
data-pve-node-01/archives recordsize 1M local
data-pve-node-01/archives compression lz4 local
data-pve-node-01/archives atime off local
data-pve-node-01/archives primarycache metadata local
The primarycache=metadata setting deserves attention. During ingestion, the JSONL file is read sequentially — the ARC's block caching provides no benefit (each block is read once and never re-read). Caching the JSONL data blocks would displace KB data and compiled Prolog clause data from the ARC, degrading query performance while the ingestion is running. primarycache=metadata retains the ZFS metadata (directory entries, attribute records) in the ARC but lets the data blocks stream through without occupying ARC space.
After ingestion completes and the JSONL file is no longer being read, the primarycache setting can be restored to all (the default) if the archive is also used for other read workloads.
13.5.3 The Three-Layer Triage Strategy
The ingestion pipeline enforces three sequential filters before a record reaches assertz:
Filter 1: Topic whitelist — find_topic/2 checks article tags against infrastructure_topic_string/2. Non-matching records are discarded after Dict parsing. At 2.1% pass rate, this eliminates 97.9% of records before any text processing occurs.
Filter 2: Text truncation — CleanText is bounded at max_text_chars (default 8,192). An article on "Linux kernel" may be 80,000 characters. Only the first 8,192 characters are retained. For KB query purposes, the first 8,192 characters of a well-structured Wikipedia article contain the definition, primary properties, and key commands — the tail is typically references and bibliography.
Filter 3: Markdown stripping — The raw text contains **bold**, [links](urls), # headings, . Stripping these removes approximately 15–20% of the character count. The stripped text is more compact and contains no URL strings that would otherwise be allocated as large string objects.
%% memory_budget_check/0
%% Verifies that the current heap usage is within acceptable bounds
%% before beginning a full ingestion run.
%% Prevents starting a run that cannot complete.
memory_budget_check :-
statistics(heapused, HeapUsed),
statistics(heaplimit, HeapLimit),
HeapFreeBytes is HeapLimit - HeapUsed,
% Require at least 2GB free for the ingestion run
RequiredBytes is 2 * 1024 * 1024 * 1024,
( HeapFreeBytes >= RequiredBytes ->
format("[OK] Heap budget: ~w MB free (~w MB limit)~n",
[HeapFreeBytes // 1048576, HeapLimit // 1048576])
;
FreeMB is HeapFreeBytes // 1048576,
RequiredMB is RequiredBytes // 1048576,
throw(error(
insufficient_heap(FreeMB, RequiredMB),
context(memory_budget_check/0,
'Insufficient heap for ingestion run — free space below 2GB minimum')
))
).
13.5.4 Periodic GC and the Steady-State Profile
The streaming loop calls garbage_collect/0 every batch_gc_every records (default: 10,000). Without periodic GC:
Each iteration allocates:
one Line string: ~7,200 bytes
one RawDict Dict: ~3,200 bytes
one CleanText string: ~4,200 bytes (for matching records)
intermediate strip strings: ~2,000 bytes
At 10,000 records without GC:
Peak transient heap: ~166 MB (10,000 × 16,600 bytes average)
With GC every 10,000 records:
GC reclaims all transient allocations from rejected records
GC reclaims intermediate strip strings
Only asserted tutorial_fact/3 clauses survive
Steady-state heap after each GC: ~43 MB (10,000 × 4,384 bytes asserted facts)
+ ~few MB transient for current record
The batch_gc_every parameter is tunable. On a server with 64GB RAM allocated to the Logic Node, batch_gc_every = 50000 is appropriate — fewer GC pauses, same memory safety. On a server with 8GB allocated, batch_gc_every = 5000 reduces peak transient heap.
13.5.5 The Atom Table Discipline Under Ingestion
The ingestion pipeline has one code path where atom creation from external data is unavoidable: atom_json_dict/3 itself interns JSON object keys as atoms. For a record {"title":"ZFS","tags":["storage"],"text":"..."}, the keys title, tags, and text become atoms. These three atoms are already interned at load time (they appear in process_record/4 as literal atoms title, tags, text). No new atoms are created from keys.
Values are strings in the Dict — atom_json_dict/3 with the default options produces string values, not atoms. The title value "ZFS" is a Prolog string. It becomes a tutorial_fact/3 argument as a string. It is never interned.
The topic atom (zfs, linux, etc.) is produced by infrastructure_topic_string/2 — a static table that maps a string to a pre-interned atom. No new atom is created.
% Verify atom discipline during ingestion
:- meta_predicate with_atom_count_check(0).
with_atom_count_check(Goal) :-
aggregate_all(count, current_atom(_), Before),
call(Goal),
aggregate_all(count, current_atom(_), After),
Delta is After - Before,
( Delta =:= 0 ->
format("[PASS] Atom Table stable: 0 new atoms from ingestion~n")
;
format("[WARN] ~w new atoms created during ingestion~n", [Delta])
).
% Usage:
?- with_atom_count_check(
ingest_archive('/var/data/archives/test_100.jsonl', _)).
[PASS] Atom Table stable: 0 new atoms from ingestion
true.
Outcome: The Local Knowledge Graph
13.6.1 The Conceptual Transition
Volume II opened with Chapter 10's assertion that the Logic Node must parse external text into typed Prolog terms or reject it cleanly. Chapter 13 closes that loop: the largest possible external text corpus — the entire Wikipedia archive — has been ingested, filtered to infrastructure-relevant content, and transformed into 143,221 structured tutorial_fact/3 facts that are queryable in O(1) time, consume under 700MB of heap, and require zero network connectivity to access.
The Logic Node now has two knowledge sources: the operator-authored KB from Volume I (structured, authoritative, small) and the archive-derived KB from Chapter 13 (encyclopedic, descriptive, large). Volume VI will combine them — the operator KB provides the authoritative state of the infrastructure, the archive KB provides the explanatory context that makes the Logic Node's answers comprehensible to engineers who are not already expert in every component they operate.
| Cloud-dependent AI knowledge | Local Knowledge Graph |
|---|---|
| Available only with network connectivity | Available air-gapped — no network required |
| Latency: 50–500ms per API call | Latency: O(1) hash lookup, < 1ms |
| Cost: per-token API charges accumulate | Cost: one-time ingestion run; zero query cost |
| Data leaves the infrastructure perimeter | Data never leaves the machine |
| Subject to provider rate limits and outages | Subject only to local heap availability |
| Cannot be extended with custom documentation | Extend by adding JSONL records to the archive |
| Article content may change without notice | Static dump — content is frozen at ingest time |
13.6.2 Verification Checklist
?- use_module('/opt/logic-node/kb/archive/archive_ingestor').
true.
% 1. setup_call_cleanup/3 guarantees stream closure
?- setup_call_cleanup(
open('/dev/null', read, S),
true, % Goal that succeeds immediately
close(S)).
true. % ✓ Stream opened and closed in one call
% 2. Topic filter rejects non-infrastructure topics
?- find_topic(["history", "medieval", "france"], _Topic).
false. % ✓ No matching topic — record discarded
?- find_topic(["storage", "linux", "filesystem"], Topic).
Topic = storage. % ✓ First matching tag: 'storage' atom
% 3. No atoms created from unknown tags
?- aggregate_all(count, current_atom(_), N1),
find_topic(["history", "medieval", "france"], _),
aggregate_all(count, current_atom(_), N2),
N2 =:= N1.
true. % Wait — find_topic/2 fails here, so let's use this form:
?- aggregate_all(count, current_atom(_), N1),
( find_topic(["history", "medieval"], _) -> true ; true ),
aggregate_all(count, current_atom(_), N2),
Delta is N2 - N1,
Delta =:= 0.
true. % ✓ No atoms created from non-matching tag strings
% 4. Markdown stripping produces clean text
?- strip_markdown("**ZFS** uses _copy-on-write_ semantics.", Clean).
Clean = "ZFS uses copy-on-write semantics.". % ✓
?- strip_markdown("See [zpool(8)](man:zpool) for details.", Clean).
Clean = "See zpool(8) for details.". % ✓
% 5. Text is bounded at max_text_chars
?- default_opts(O),
atom_length('x', 1),
numchars(8193, LongStr), % hypothetical 8193-char string
process_record(LongStr, O, counts{processed:0,asserted:0,rejected:0,errors:0}, _).
% (Verifiable by inspecting CleanText length after processing a long record)
% 6. tutorial_fact/3 first-argument index is active
?- predicate_property(tutorial_fact(_,_,_), indexed([1])).
true. % ✓ JIT index on arg 1
% 7. O(1) topic query — timing check
?- get_time(T1),
findall(Title, tutorial_fact(zfs, Title, _), Titles),
get_time(T2),
Elapsed is T2 - T1,
Elapsed < 0.01. % ZFS bucket: 847 facts, should complete in < 10ms
true. % ✓
% 8. ingest_summary/1 is updated after ingestion
?- ingest_summary(S), S.asserted > 0.
true. % ✓ Summary persists after run
% 9. Atom Table stable across ingestion
?- with_atom_count_check(
ingest_archive('/var/data/archives/test_1000.jsonl', _)).
[PASS] Atom Table stable: 0 new atoms from ingestion
true.
% 10. memory_budget_check/0 — heap free before run
?- memory_budget_check.
[OK] Heap budget: 58241 MB free (65536 MB limit)
true.
13.6.3 Volume II Closure
Volume II established the complete Knowledge Acquisition pipeline for the Logic Node:
Chapter 10 — DCGs as sovereign parsing primitives. Difference lists, compiler expansion, semantic actions, network address parsers, the parsing trust boundary. The Logic Node gained the ability to transform untrusted character streams into typed Prolog terms without Atom Table side effects.
Chapter 11 — Streaming system logs. library(pure_input), phrase_from_file/2, error recovery, auth_event{} Dicts, brute-force detection as a declarative rule. The Logic Node gained the ability to reason over infinite text streams in O(1) RAM.
Chapter 12 — Bidirectional configuration management. Reversible DCGs, diff-friendly ASTs, nginx_reversible.pl, HA compliance verification, pre-write proofs. The Logic Node gained the ability to mutate the external world from internal logic, with syntactic validity guaranteed by construction.
Chapter 13 — Offline archive ingestion. setup_call_cleanup/3, JSONL streaming, Markdown stripping, JIT indexing, heap budget management. The Logic Node gained encyclopedic knowledge of its subject domain — available air-gapped, at O(1) query latency, with zero per-query cost.
Volume III opens with Chapter 14: Concurrent Logic — thread_create/3, thread_send_message/2, the concurrent query handler, and the mutex-protected write protocol for nginx_write_file/2. The Knowledge Acquisition infrastructure built in Volume II will run as background threads: the log ingestor streaming auth events, the archive query handler serving RAG lookups, and the config mutation pipeline executing under with_mutex/2 to prevent concurrent write conflicts on the same config file.
Exercises
Exercise 13.1 — setup_call_catcher_cleanup/4 Audit Wrapper
Extend ingest_archive/3 to use setup_call_catcher_cleanup/4 instead of setup_call_cleanup/3. The Catcher argument should be logged to /opt/logic-node/logs/ingest_audit.log with a timestamp (Unix epoch float from get_time/1), the FilePath, and the exit condition. Verify that the audit log entry is written when: (a) ingestion completes normally, (b) ingestion is interrupted by ^C (SIGINT converted to Prolog exception), and (c) a malformed JSONL file causes atom_json_dict/3 to throw on the first line.
Exercise 13.2 — Full CommonMark Heading Stripper
The strip_atx_headings/2 implementation strips only one leading # character. ATX headings can have up to six # characters (######). Write strip_atx_headings_full/2 as a proper DCG over a line's character codes that: matches any sequence of 1–6 # codes at line start, followed by at least one space, and replaces the ###... prefix with nothing. Write a DCG that handles all six heading levels and is bidirectional — the generator should round-trip a stripped heading back to the #-prefixed form if the heading level is stored as a metadata integer in the AST.
Exercise 13.3 — Incremental Archive Update
ingest_archive/3 always starts from the beginning of the JSONL file. For a daily Wikipedia dump update that adds only new articles as appended lines, re-processing the entire file wastes hours. Implement ingest_archive_incremental/3 that reads and stores the byte offset of the last successfully processed line (via stream_property(Stream, position(Pos)) and seek/4) in a checkpoint file /opt/logic-node/kb/archive/checkpoint.pl. On subsequent runs, seek/4 positions the stream at the checkpoint before processing begins. Verify that running the incremental ingestor on the same file twice produces the same tutorial_fact/3 count as one full run.
Exercise 13.4 — Secondary Index for Title Lookup
tutorial_fact(_, "zpool", _) requires an O(143k) full scan. Add a secondary index predicate tutorial_title_index(Title, Topic) that is asserted alongside each tutorial_fact/3 during ingestion and allows O(1) title lookup:
tutorial_title_index("zpool", zfs). % title → topic
% then: tutorial_fact(Topic, "zpool", Text) via two O(1) lookups
Measure the ingestion time overhead of asserting the secondary index alongside the primary fact. Verify that tutorial_fact_by_title/3:
tutorial_fact_by_title(+Title, -Topic, -Text) :-
tutorial_title_index(Title, Topic),
tutorial_fact(Topic, Title, Text).
executes in O(1) time regardless of KB size.
Exercise 13.5 — KB Persistence Across Restarts
tutorial_fact/3 facts are lost when the SWI-Prolog process exits. Implement save_kb/1 and load_kb/1:
save_kb(+FilePath) % Writes all tutorial_fact/3 clauses to FilePath as Prolog terms
load_kb(+FilePath) % Reads and asserts all terms from FilePath
Use with_output_to/2 and write_canonical/1 for the save, and read_term/3 in a setup_call_cleanup/3 loop for the load. Verify that save_kb/load_kb round-trips all 143,221 facts with identical Topic, Title, and CleanText values. Compare the load time from a saved .pl file against re-running ingest_archive/3 from the JSONL source — the saved file should load in seconds, not hours.
Further Reading
- SWI-Prolog Manual:
setup_call_cleanup/3—https://www.swi-prolog.org/pldoc/man?predicate=setup_call_cleanup/3— definitive specification of the WAM-level cleanup guarantee - SWI-Prolog Manual:
setup_call_catcher_cleanup/4—https://www.swi-prolog.org/pldoc/man?predicate=setup_call_catcher_cleanup/4 - SWI-Prolog Manual:
atom_json_dict/3—https://www.swi-prolog.org/pldoc/man?predicate=atom_json_dict/3— JSON ↔ Dict conversion, value type handling - SWI-Prolog Manual: Dynamic predicates and indexing —
https://www.swi-prolog.org/pldoc/man?section=dynamic-index— JIT indexer behaviour,index/1directive, deep indexing - Kiwix / OpenZIM: ZIM file format specification —
https://wiki.openzim.org/wiki/ZIM_file_format— binary format for offline Wikipedia archives - Kiwix:
zimdumputility —https://github.com/openzim/zim-tools— ZIM → JSONL extraction tool used in this chapter's pipeline - JSONL specification —
https://jsonlines.org/— minimal spec for newline-delimited JSON - MacKay, D.J.C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press. Appendix C: available free at
http://www.inference.org.uk/mackay/itila/— offline-first reference distribution model - Baader, F. & Nipkow, T. (1998). Term Rewriting and All That. Cambridge University Press. Chapter 4: term indexing strategies relevant to the WAM's first-argument hash scheme.
No comments to display
No comments to display