Skip to main content

The Sea of Fate Storage proposal

The Physical Foundation: Pear and the Archival Array

The practical execution of the Silo Strategy is anchored by the physical hardware of the host Pear. While the broader project is conceptualized as a "Digital Silo," its survival depends on the specific storage characteristics of this machine. By utilizing a multi-tiered storage approach on Pear, we create a system that can simultaneously handle high-speed data ingestion and long-term bit-perfect preservation.

Proxmox and the ZFS Advantage

Pear operates on the Proxmox virtualization platform, which provides native integration with the ZFS file system. For an archival project, ZFS is the essential choice due to its self-healing nature. Unlike traditional RAID, ZFS uses end-to-end checksums to detect and automatically repair "bit rot"—the silent corruption of data that can occur over years of storage. On Pear, this ensures that the 2025 baseline remains an exact, unaltered copy of the human-primary web.

The High-Capacity Vault: IronWolf RaidZ1

The "heavy lifting" of the silo is performed by a dedicated pool of three 16TB Seagate IronWolf drives configured in RaidZ1.

  • Capacity and Redundancy: This configuration provides approximately 32TB of usable archival space while allowing the system to survive the total failure of any single drive without data loss.

  • Optimized for Archives: The IronWolf series is specifically engineered for 24/7 NAS environments, featuring rotational vibration sensors that maintain performance in multi-drive arrays. This pool is designated as the permanent home for large-scale ZIM files and scholarly snapshots.

The Ingestion Pipeline: SSD Scratch and Rpool

To maintain the efficiency of the silo, the hardware is partitioned to separate "working" data from "archived" data.

  • The Scratch Disk (1TB SATA SSD): This disk serves as the primary ingestion point. Large-scale downloads and data extractions are high-intensity operations that can bottleneck mechanical drives. By using the 1TB SSD as a "scratch" area, the system can rapidly ingest, verify, and decompress data before moving the finalized, "clean" archives to the IronWolf vault.

  • The Rpool is a mirrored pair of 4TB  SSDs. The primary system pool hosts the Proxmox environment and the virtual machines. Keeping the operating systems on this separate, high-speed pool ensures that the management layer remains responsive even when the archival pool is under heavy load. although the SSDs are not especially high reliability they should last for a reasonable length of time if the amount of data written is kept fairly small and as they have 4 TB available they should stay well within their total data written

Hardware-Level Cognitive Sovereignty

By keeping this entire stack—from the ZFS checksums on the IronWolf drives to the high-speed processing on the SSDs—within the local home lab, the project achieves true independence. The information is not a "service" provided by a third party; it is a physical asset. This hardware-level sovereignty is the final layer of defense against the "flattening" and "washing" of the digital world, providing a local, immutable ground truth that exists outside the reach of the live web's volatility.

ZFS on Proxmox: Performance and Efficiency Best Practices

This video provides an in-depth look at how ZFS operates within Proxmox, covering the essential tuning and design choices that ensure long-term data integrity for large-scale archival pools like ours.

Processing Constraints and the Data-First Mandate

A significant challenge in the current phase of the project is the disparity between data acquisition and computational power. While the home lab is equipped for high-capacity storage, the hardware required for large-scale AI training—specifically high-vRAM GPU clusters—remains a significant bottleneck. However, the Silo Strategy dictates that the "fuel" (data) is more perishable than the "engine" (processing). Therefore, the initial stages are defined by a data-first mandate: we prioritize the collection and preservation of human-primary information now, even if the local capability to process or "train" on that data is not yet fully realized. 

The Storage Ceiling: Selective Archiving

With approximately 32TB of raw usable space on the Pear archival array, the project faces a definitive "storage ceiling." In an era where the total volume of the human-primary web is measured in petabytes, 30TB represents a focused, high-value extraction rather than a blind mirror of the internet. Because the cost of acquiring additional 16TB IronWolf drives remains prohibitively expensive for a standard home lab budget, the project must transition from a general collector to a selective curator. Every gigabyte allocated must be justified by its "informational density" and its resistance to the flattening and washing of the live web.

Prioritizing Informational Density

To maximize the utility of the 30TB silo, we focus on sources that offer the highest ratio of knowledge-to-bytes.

  • Encyclopedic ZIMs: A 100GB Wikipedia ZIM contains nearly the entirety of human historical and scientific consensus in a highly compressed, text-heavy format. This is the highest density data available.

  • Technical Documentation: PDF and Markdown repositories of hardware schematics, software manuals, and security archives (like GRC) provide "tacit knowledge" that is often the first to be paywalled or AI-washed.

  • Academic Metadata: Tools like OpenAlex allow us to store the "map" of human discovery (who discovered what, and when) using relatively little space compared to the raw video or image data that dominates the modern web.

The current state of the home lab presents a stark contrast between massive storage potential and restricted computational throughput. While the 32TB ZFS array on Pear provides a formidable "vault" for human knowledge, the hardware required to perform high-parameter LLM training or complex inference remains a significant financial and technical hurdle. However, the logic of the Silo Strategy dictates that this "Processing Gap" is a temporary state of affairs, whereas the "Data Gap"—the window in which clean, human-primary information can be harvested—is rapidly closing. If we wait for the day when high-vRAM GPUs are affordable enough for the average home lab to run 70B+ parameter models locally, the "fuel" required to power those models will have already been "AI-washed," paywalled, or corrupted by the recursive feedback loops of the flattening. Therefore, the immediate mandate is not the immediate use of the data, but its absolute preservation. We are effectively "front-loading" the hardest part of the AI revolution by securing the raw intelligence of the 2025 baseline now, accepting that our current ability to query or train on this data is primitive compared to what it will be in five to ten years.

The 30TB Ceiling: The Necessity of Aggressive Curatorial Triage

Operating within a 30TB usable storage limit in an era of petabyte-scale data growth requires a shift from a "collector" mindset to a "curator" mindset. Because the cost of expanding the IronWolf array with additional 16TB drives is currently prohibitively expensive, we cannot afford the luxury of mirroring the entire internet. Instead, we must practice a form of "Archival Triage," where every gigabyte allocated to the Pear ZFS pool is vetted for its long-term survival value and informational density. This means prioritizing "Seed Data"—the foundational texts, technical schematics, and encyclopedic snapshots that contain the highest ratio of human signal to digital noise. A 100GB Wikipedia ZIM file, for example, offers a higher density of stable human consensus than ten terabytes of uncurated social media scrapes. This selective pressure forces us to be disciplined; we are not building a junk drawer of the internet, but a high-resolution "Time Capsule." If a dataset does not provide a unique, pre-wash perspective or critical technical "tacit knowledge," it is rejected in favor of leaner, more essential archives that fit within our physical 30TB envelope.

Information Hoarding as a Hedge Against Digital Obsolescence

The act of "hoarding" data in a "cold storage" state is a strategic hedge against the inevitable march of digital obsolescence and the "Dead Internet" phenomenon. We anticipate a future where the live web is so heavily populated by synthetic agents and "laundering" filters that it will be impossible to distinguish genuine human thought from a corporate-optimized simulation. By filling our available 30TB now, we are creating an "Off-Chain" record that does not rely on the persistence of the original hosts or the benevolence of future data owners. This "Cold Storage" approach is based on the historical precedent of the "Monastic Scriptorium"—where knowledge was preserved in isolated silos during periods of external chaos, only to be "rediscovered" and utilized when the infrastructure to support it was finally rebuilt. In our context, the "rebuilding" is the future drop in GPU costs and the advancement of local, open-weights AI. When that day arrives, the Sea Of Fate silo will be one of the few remaining sources of "Pre-Wash" reality, providing the clean training weights needed to build an AI that actually understands the world as it was, rather than as the "washers" want us to believe it was.

Strategic Deletion and Triage

Operating within a 30TB limit requires a continuous process of "archival triage." As new snapshots become available, older or lower-priority data may need to be pruned to make room for higher-fidelity sources. This selective pressure ensures that the Sea Of Fate silo does not become a cluttered "data swamp," but remains a lean, high-integrity library of the 2025 baseline. This disciplined approach to storage management is the final component of the Silo Strategy, ensuring that our limited physical resources are always working to preserve the most critical fragments of the human-primary world.