What SSCA Actually Is
Most compression works at the character or bit level. SSCA works at the meaning level — replacing words, phrases, and semantic concepts with compact symbols drawn from a 247-primitive universal lookup table, then cascading seven additional specialized layers on top. The result is lossless compression that scales with data repetition and domain specificity, not just entropy.
The inventor, R. Claude Armstrong, is an 80-year-old independent engineer from Everett, Washington, with 75+ years of pattern recognition experience across industrial crane systems, wastewater treatment, welding processes, and medical equipment maintenance. SSCA grew from observing how mission-critical systems — including the Apollo Guidance Computer — achieved remarkable efficiency through symbolic abstraction rather than raw compute power.
The core insight: language and structured data are massively redundant at the meaning level. The word "rapid," "fast," "quick," and "swift" all represent the same mental image. Enterprise logs repeat the same structural template millions of times per day. Legal documents restate the same clause dozens of times per contract. SSCA exploits each of these redundancy types with a dedicated layer.
The 8-Layer Compression Stack
Layers 1–4 form the open-source foundation. Layers 5–7 are proprietary. Layer 8 is an optional specialty module for literary/marketing content.
| Layer | Name | Mechanism | Status |
|---|---|---|---|
| L1 | Symbol SubstitutionOPEN | 550-entry word/phrase → §symbol dictionary. 20–30% on general text. | ✓ Production |
| L2 | Contextual CompressionOPEN | Bigrams, trigrams, collocations (~200 patterns). +7–12% additional. | ✓ Production |
| L3 | Hierarchical AbstractionOPEN | Hypernym substitution, redundant modifier removal. +10–15%. Semantic lossless. | ⚠ Bug — see §04 |
| L4 | Predictive InferenceOPEN | Omission-map: drops highly predictable words, stores position map. +8–15%. | ✓ Production |
| L5 | Data-Driven DictionaryPROPRIETARY | Learns optimal symbols from corpus. Domain detection built in. +10–25%. | ✓ Production |
| L6 | Template RepetitionPROPRIETARY | Detects structural templates (logs, forms). Stores template once + variables. | ✓ Production |
| L7 | Cross-ReferencePROPRIETARY | Replaces repeated long strings with short §RN reference IDs. +10–25%. | ✓ Production |
| L8 | Metaphor CompressionSPECIALTY | 25+ metaphor families (Lakoff/Johnson). Useful for literary/marketing content only. | Optional module |
Each layer operates on the output of the previous one. The DNA/P³ router at the front of the pipeline detects domain, data type, and whether content is already compressed — routing accordingly before any layer processing begins.
The Numbers, Honestly Stated
The original per-layer READMEs stated cumulative compression claims as if each layer's gain was additive off the original. That is mathematically incorrect — each layer compresses what remains, making gains multiplicative. Below is the corrected picture, verified by stacking analysis:
// Cumulative Compression — Claimed vs Mathematically Honest
The log and legal claims are defensible because those data types are genuinely and measurably highly repetitive — L6 and L7 were purpose-built for them. The general text claims need recalibration. This correction matters for patent filings and investor discussions: claims that can be disproved by running the code on real data are a liability, not a strength.
What Works, What Needs Work
This is an honest assessment based on reading all eight layer implementations, running the code, and checking the math. The goal is to show engineers exactly what they're walking into.
Layers 1, 2, 4, 5, 6, 7
All run correctly. APIs are clean. Cascade integration is straightforward. Losslessness verified. L6 on server log data is genuinely impressive.
Layer 3 — Symbol Format Bug
Symbol format §H:building (11 chars) is longer than the word it replaces (8 chars), producing negative compression. Two-line fix needed.
Layer 8 — Scope Clarification
Solid implementation but mislabeled as a core layer. Contributes near-zero on enterprise data types. Should be an optional specialty module.
DNA/P³ Router
Specified in architecture and flowchart, but not yet implemented as a standalone module. Currently layers run serially. The router is the critical next build.
OCR / PDF Pre-Processing
Scanned document ingestion, PDF text extraction, image/text stream separation, and DocID tagging are architected but not implemented.
Patent Provisional Filings
Three provisionals covering the first three data efficiency parameters of the 9-layer stack. Architecture documentation extensive.
The Layer 3 bug, for those who want to see it:
# Actual output running layer3.py against real sentences IN: he drove his sedan to the office building OUT: he drove his §H:car to the §H:building building (-14.6%) # Bug 1: §H:building (11 chars) > building (8 chars) → file gets LARGER # Bug 2: office → §H:building collides with next word 'building' # decompresses to "the building building" — grammatically broken IN: the laptop is on the desk OUT: the §H:computer is on the §H:furniture (-52.0%) # Negative compression. Symbol is longer than the word it replaced. # The fix: numbered symbols like L1/L2 already use correctly §H:building → §H14 (4 chars vs 11 — always shorter than source) §CAT:big → §C3 (3 chars)
The Work That Needs Doing
These are ordered by foundation-first logic, not complexity. A small team of senior engineers could move through this stack systematically.
-
01
Fix Layer 3 Symbol Format
Replace verbose
§H:termwith numbered compact symbols (§H14etc). Add single-token lookahead to prevent collision with adjacent identical words. Estimated: 1–2 days for an experienced Python dev. -
02
Build the DNA/P³ Router
The domain classifier that sits in front of the entire stack. Detects data type via magic-number / header / entropy scan. Routes to appropriate tier parsers. Implements bypass for pre-compressed content. This is the critical path item — without it, the stack can't self-configure.
-
03
Build the OCR/PDF Pre-Processing Pipeline
Scanned document ingestion → PDF text extraction → image/text stream separation → DocID tagging for stream reunion at destination. Connects the "left pipeline" shown in the architecture flowchart to the main compression stack.
-
04
Calibrate and Validate Compression Claims
Run all layers against representative corpora for each data type (logs, legal docs, medical records, general text). Produce reproducible benchmark results. Replace current additive percentage claims with verified multiplicative figures. This output becomes the patent and investor evidence base.
-
05
Integration Layer: Full Pipeline Orchestration
Wire all 7 core layers + router + OCR pipeline into a single callable interface with proper error handling, logging, and configuration. Define the production API surface. Prepare for external pilot deployment.
Why Compression at This Level Matters Now
Data infrastructure cost — storage, transmission, compute, cooling — is the limiting constraint for AI companies, cloud providers, and any organization operating at scale. The "four walls" that define every large data operation are:
Energy Consumption
Less data stored and transmitted means less I/O, less compute, meaningfully lower power draw across massive infrastructure.
Hardware Requirements
Storage and memory that doesn't need to be purchased, racked, or maintained because the data is smaller to begin with.
Infrastructure Costs
Bandwidth, data center space, cooling — all scale with data volume. Compression is a multiplier on all of them simultaneously.
Cooling Systems
One of the fastest-growing infrastructure costs in AI. Reduced compute cycles from smaller data reduces thermal output directly.
SSCA's approach — domain-aware, meaning-level compression that improves with data repetition — is particularly well-suited to the workloads that drive the highest infrastructure costs: AI training logs, legal document archives, medical record systems, and structured telemetry at scale.
Interested in Contributing?
This is an independent inventor project seeking experienced engineers or a CS team for development partnership, pilot validation, or research collaboration. Provisional patents filed. Architecture documentation extensive. The code is real.