Skip to main content

Data Acquisition

The Mechanics of Acquisition: Targeted Ingestion and Tool Selection

Establishing a 30TB "Silo" requires a disciplined approach to how data enters the network. In an environment where the open web is being progressively walled off, the "where" and "how" of data acquisition become as critical as the storage itself. We move away from the erratic nature of live browsing toward a structured, automated ingestion pipeline that utilizes high-speed mirrors and specialized archival tools. The objective is to secure the highest-fidelity copy of information with the least amount of digital "noise," ensuring that the final archive remains a lean, searchable asset rather than an unmanageable data swamp.

High-Speed Mirrors and the Kiwix Infrastructure

For foundational datasets like Wikipedia, the most efficient method of ingestion is through the Kiwix mirror network. Rather than attempting to scrape millions of individual live pages—which is slow, prone to errors, and likely to trigger rate-limiting—we download pre-packaged ZIM files from dedicated high-bandwidth mirrors. These mirrors, such as mirrors.dotsrc.org or your.org, provide the 2025 snapshots in a single, compressed format that includes all text, images, and internal metadata.

By utilizing mirrors located in regions with robust backbone connectivity (such as Denmark or North America), we can maximize the ingestion speed of the 1TB SSD scratch disk on Pear. These mirrors act as the "Wholesale Distribution Centers" of the digital world; they allow us to bypass the overhead of the HTTP protocol for millions of small files and instead move 100GB blocks of verified human knowledge directly into our silo. This is the first step in creating a "Temporal Baseline"—securing a bit-perfect copy of the consensus web before it is subjected to the recursive erosion of the flattening.

Precision Archiving with wget: The Surgeon’s Tool

While Kiwix handles the "macro" level of general knowledge, wget serves as the primary tool for targeted, site-specific archiving. This command-line utility is indispensable for its ability to perform "Recursive Retrieval"—following every link on a specific technical domain to create a local, offline-capable mirror. For a home lab operator, wget is the surgeon's tool: it allows us to surgically extract the documentation of a legacy software project or the archives of a specialized security forum (such as GRC's Security Now) without the bloat of a modern browser.

A typical archival command for our silo involves specific flags like --mirror, which ensures the local copy is an exact replica, and --convert-links, which rewrites the internal URLs so the site remains fully functional even when the Pear host is entirely offline. By carefully setting the --domains and --no-parent flags, we prevent the "runaway crawl" that can occur when a scraper accidentally follows a link to a giant platform like Wikipedia, ensuring that our 30TB ceiling is respected. This is "High-Resolution" archiving: it preserves the original layout and navigation of the site, providing the human context that is often lost in raw text extraction.

ArchiveBox: The Comprehensive Archival Manager

For modern, complex websites that rely heavily on JavaScript and dynamic content, wget is often insufficient. In these cases, we deploy ArchiveBox, a powerful, self-hosted web archiving suite. ArchiveBox acts as a "manager of managers," orchestrating multiple tools (including wget, Chromium headless, and single-file exporters) to create a multi-layered snapshot of a webpage.

When we ingest a URL into ArchiveBox, it doesn't just save the text; it creates a PDF, a static HTML snapshot, a "SingleFile" archive, and even a screenshot. This redundancy is our primary defense against "AI Washing." If a government or corporate entity later alters the live page, we have multiple independent formats—some of which are image-based and thus harder to "wash" through text-based AI—to prove the original state of the data. Furthermore, ArchiveBox organizes these snapshots into a searchable, web-based UI, turning a folder of raw files into a functional, private library that mirrors the logic of the original web.

The Scratch Disk Triage: From Ingestion to the Vault

The final stage of the methodology involves the 1TB SATA SSD scratch disk on Pear. All incoming data from wget, rsync (for Kiwix), and ArchiveBox is initially landed here. This "Staging Area" is where the critical work of verification and triage occurs. We use this high-speed flash storage to perform checksum validations (SHA-256) to ensure the download was not corrupted and to run preliminary "extraction tests" to see if the data is high-quality or just digital noise.

Once a dataset is verified as "Clean 2025 Baseline," it is moved from the volatile 1TB SSD to the permanent 16TB IronWolf ZFS array. This migration is the final act of "Freezing" the data. By separating the high-intensity random I/O of the download and extraction phase from the sequential storage phase, we protect the mechanical health of our IronWolf drives and ensure that our primary 32TB vault remains an orderly, high-performance repository of the human-primary world.

 

The Offline Transformation: Decoding the Cloud-Native Archive

A pivotal challenge in the Silo Strategy is the fundamental mismatch between the architecture of global data repositories and the requirements of a localized home lab. Many of our primary data sources—most notably OpenAlex—were never intended for simple "point-and-click" local use. These datasets were designed to live within the high-bandwidth, horizontally scalable environment of the AWS Cloud. When we transition this data from an Amazon S3 bucket to the Pear archival array, we are not merely moving files; we are performing a complex "extraction and re-contextualization" that presents its own set of technical hurdles.

The Cloud-Native Compression Barrier

Repositories like OpenAlex are distributed as massive, multi-part snapshots consisting of thousands of gzip-compressed JSON Lines (.gz / .jsonl) files. In the AWS environment, these files are meant to be queried in-place using massive compute clusters like Amazon Athena or Redshift, which can parallelize the reading of thousands of small files simultaneously. However, for a home lab, this structure is highly fragmented.

The first challenge is the sheer scale of the expansion: the OpenAlex snapshot, for instance, arrives as approximately 330 GB of compressed data, but it expands to over 1.6 TB once decompressed. This 5x expansion factor immediately stresses the available storage on the 4TB SSD rpool or the 1TB scratch disk, necessitating a "streaming" approach where data is decompressed, processed, and moved to the IronWolf vault in a continuous loop to prevent a storage overflow.

The Challenge of Local Indexing and Search

Once the data is extracted, we face the "Searchability Gap." On the live web, OpenAlex provides a powerful API and a web interface that allow for near-instant searching across 250 million publications. Once that data is sitting as raw JSON lines on the Pear host, it is essentially a "digital pile of bricks"—it is all there, but it is impossible to find anything without an index.

Building a local index for 1.6 TB of text data is a resource-intensive process that requires significant CPU cycles from the Proxmox host. Unlike a cloud provider that can spin up a thousand virtual machines to index the data in minutes, our silo must do this "line by line." We have to transform these nested, complex JSON objects into a flattened, searchable format (like CSV or a relational database schema) just to make the archive functional. This "Processing Tax" is the price we pay for cognitive sovereignty; we are rebuilding the search infrastructure of a billion-dollar company inside a single home server.

Bypassing the "API Gatekeeper"

The reason we accept these challenges is to eliminate our dependence on the API Gatekeeper. Most researchers rely on the OpenAlex REST API, which is "free" but subject to rate limits and the eventual risk of "AI Washing" at the source. By downloading the raw snapshot, we are taking the "Ground Truth" as it exists today.

We are moving away from the "Cloud-First" model—where you only see what the API allows you to see—and toward a "Data-First" model where we own the entire stack. Even if OpenAlex were to go dark tomorrow or hide its best data behind a paywall, our local 1.6 TB extraction remains a permanent, unalterable asset. The difficulty of the extraction is a one-time investment that secures a lifetime of filtered-free access to the global scholarly record.

The Scalability Wall: Beyond the Limits of Kiwix-Serve

As the project transitions from a small collection of essential wikis to a massive, multi-terabyte archival repository, we encounter a significant technical bottleneck: the scalability limits of the standard Kiwix-serve software. While Kiwix is an unparalleled tool for delivering a single 100GB Wikipedia ZIM file to a local network, it was fundamentally designed as a "reader" or a lightweight server for a handful of archives. When tasked with managing and serving a library of over 270 individual ZIM files totaling nearly 2 TB of data, the architecture begins to struggle under the weight of its own metadata.

The Problem with "Massive Library" Indexing

The primary issue with using Kiwix-serve for an archive of this scale lies in its library management and search indexing. In its default configuration, Kiwix-serve must scan and initialize the internal metadata and search indices for every ZIM file in its path upon startup.

  • Memory and Startup Latency: Loading hundreds of large archives creates a massive spike in memory usage and a "startup hang" that can last minutes or even cause the service to crash.

  • Search Fragmentation: Standard Kiwix-serve is often restricted in its ability to perform "Cross-ZIM" full-text searches across hundreds of files simultaneously. In many cases, the search is scoped only to the currently open ZIM file, making it impossible to find a specific technical term that might be scattered across 50 different StackExchange or Reddit archives without manually opening each one.

  • Performance Degradation: Users have documented that if a single ZIM file in a large library is "slow" (due to a high compression ratio or being stored on mechanical platters), it can bottleneck the response time for the entire server, leading to 504 Gateway Timeouts and an unresponsive UI.

Moving Toward the "ZIM-to-Markdown" Extraction

Because the goal of Project Codex is not just to "browse" the data but to make it a functional, searchable asset for future AI training, we must move beyond the browser-based interface of Kiwix. The ZIM format is a highly optimized, compressed "blob." While great for saving space, it is a "black box" for traditional indexing tools like Grep, Elasticsearch, or LLM ingestion scripts.

To access the data in a truly meaningful way, we have to treat the ZIM files as Raw Sources rather than final destinations. This necessitates an extraction layer—a workflow where we programmatically "unpack" the ZIM files into organized directories of Markdown (.md) files. This transformation offers several advantages:

  1. Direct Indexing: Local search tools can index the raw text files much faster than they can probe into a compressed ZIM.

  2. Greppable Knowledge: You can use standard command-line tools to search across the entire 1.9 TB of text in seconds.

  3. AI Readiness: LLMs cannot "read" a ZIM file, but they can easily ingest a folder of Markdown files.

Alternative Tools for Massive Archives

For those who still need a high-performance web interface for these hundreds of files without the extraction overhead, we must look at more advanced tools. One solution is the use of SearXNG or Zimply as a meta-search layer. By pointing a local instance of SearXNG at our Kiwix-serve instance, we can aggregate results from multiple ZIMs into a single, unified search page.

Additionally, for the most demanding parts of the archive, we may need to deploy a Model Context Protocol (MCP) server specifically for OpenZIM. Tools like openzim-mcp allow an AI assistant to "browse" your local ZIM directory as a tool, effectively giving your local LLM an "offline brain" that can query the 1.9 TB of data directly without needing a traditional web UI. This shifts the focus from "human reading" to "machine intelligence," which is the ultimate objective of the Sea Of Fate silo.