Architecture
A compiler for hostile, underspecified legal deltas.
LawVM's architecture follows from the structural properties of law itself. The essay derives the necessity; this page describes the result.
The compiler model
A jurisdiction frontend is a phased compiler with explicit contracts:
- Acquire and archive source artifacts
- Parse amendment text into a clause surface
- Extract payloads and normalize source-locally
- Elaborate against live legal state (snapshot-pure)
- Lower to canonical typed operations
- Replay over a base state
- Materialize point-in-time text
- Adjudicate against oracle or witness surfaces
The pipeline separates lowering, target resolution, replay, and divergence accounting so that mismatches are inspectable rather than collapsed into a single opaque failure state.
Two planes
LawVM operates on two simultaneous planes:
Semantic plane: source artifacts → clause surface → payload surface → elaborated intent → canonical effects → timelines → PIT materialization. This is the path from raw legal text to point-in-time state.
Epistemic plane: parse witnesses → observations → obligations → adjudications → claims → evidence bundle. This is the path that records why the result should be trusted — what was observed, what was inferred, what was recovered, and what remains unresolved.
Both planes run together. A replay result without its epistemic trail is not a LawVM result.
Three hard waists
The architecture has three stable interfaces that must not be bypassed:
- Clause surface: the first stable representation of amendment meaning. A typed AST for amendment instruction language.
- Payload surface: the amendment body after source-local normalization, before live-state-dependent meaning recovery. This is where the source text stops being raw and starts being structured, but meaning recovery against the current statute state has not yet happened.
- Canonical execution: replay is designed to consume typed canonical execution artifacts, not raw amendment XML. If meaning cannot be resolved before this boundary, it should become a finding, failed operation, or explicit strict-mode barrier rather than an implicit replay mutation.
Why deterministic parsing works
Legislative amendment language is a semi-formal instruction language embedded in ordinary human legal text. In Finland, the Lainkirjoittajan opas (legislative drafting guide) specifies that the johtolause (amendment preamble) follows a fixed pattern: kumotaan (repeal), muutetaan (amend), lisätään (insert), in prescribed order, ending with seuraavasti:. Target addresses follow hierarchical conventions: lain 12 §:n 2 momentti (section 12, subsection 2 of the act).
This structure is parseable with conventional grammars. LawVM uses PEG (parsing expression grammars) to extract clause structure, target addresses, action types, and payload boundaries from raw amendment text. No LLMs, no statistical models, no training data, and no requirement that legislators rewrite law as code. The parser is deterministic: the same input always produces the same parse.
Prior work already parsed amendment language with conventional tools. Arnold-Moore parsed Tasmanian amendments with ATN parsers in the 1990s. Ogawa et al. formalized Japanese amendment sentences. Italian and Polish researchers have extracted amendment metadata with rule-based methods. LawVM pushes the claim further: ordinary amendment streams can be compiled into auditable point-in-time legal state with typed operations, timelines, provenance, and classified residuals.
Where grammars break, LawVM falls back to quirks-mode recovery with explicit provenance. Where they succeed, the result is deterministic, inspectable, and reproducible.
No silent repairs
Historical legislation contains malformed XML, editorial shortcuts, implicit targets, stale consolidated witnesses, and source defects. LawVM cannot avoid recovery logic. The governing rule is that recovery must be owned.
If a rule deletes, moves, relabels, reroutes, widens, narrows, or otherwise changes legal state, it must have a stable rule id, emit a typed observation or finding, preserve before/after evidence, and be rejectable where strict mode requires it. A heuristic may be necessary; an invisible heuristic is a compiler bug.
This is the boundary between LawVM and ordinary scraping. The system is allowed to say "I recovered this because the source had this shape." It is not allowed to make the tree look cleaner without leaving an evidence trail.
Strict mode and quirks mode
LawVM serves two worlds:
Quirks mode is for the historical corpus. Real legislative text is full of omitted context, editorial shortcuts, inconsistent numbering, source encoding oddities, and amendments that only make sense against a specific live consolidated witness. Quirks mode uses recovery heuristics — but marks every recovery path with provenance. It never pretends inferred structure was explicit in source.
Strict mode is for a future where law is authored to compile cleanly. Every amendment is structurally unambiguous, every target is explicitly addressable, every action is typed, every temporal effect is explicit. Strict mode forbids: target guessing, hidden insertion anchoring, fallback whole-section replacement, ambiguous omission expansion, silent date estimation.
The target publication model keeps law human-readable while also emitting canonical machine-readable state/change artifacts alongside the human text. Strict mode is the compilation target for that future. Quirks mode is the recovery compiler for the past.
Frontend / kernel boundary
The shared kernel is jurisdiction-agnostic: canonical legal-address and tree model, operation vocabulary, replay execution, timeline semantics, materialization, structural invariants.
Frontends are jurisdiction-local: source acquisition, parsing conventions, drafting idioms, payload extraction, elaboration rules, source pathology, oracle comparison.
The important design question is never "can we extract something useful?" It is: what is the smallest honest executable claim for this jurisdiction, and what source family makes that claim defensible?
Beyond Layer 0
LawVM is deliberately narrow. It computes what the legal text says at a point in time. It does not compute what the law means, how it is applied in practice, or what it costs. Those are higher layers:
| Layer | Question | Scope |
|---|---|---|
| L0: LawVM | What does the text say? | Text-state compilation, provenance, timelines |
| L1: Legal views | Which view to run? | Territorial, commencement, transitional overlays |
| L2: Interpretation | What do authorities say it means? | Court holdings, guidance, doctrine |
| L3: Praxis | How is it actually applied? | Enforcement, institutional behavior |
| L4: Reasoning | What follows for this fact pattern? | Compliance, simulation, argument |
| L5: Products | What can users do? | Search, Q&A, drafting assistants |
Upper layers attach claims to L0 anchors without mutating the text-state kernel. LawVM is designed as a substrate: stable identities, span-level anchoring, explicit provenance, overlay hooks.
Downstream examples: Lakikartta joins the legal graph to budget data (92k statutes, 500B€ budget weights, PageRank/Katz/DebtRank centrality). MeV mechanism tests analyze whether government bills' mechanisms produce their stated goals. These are separate projects that demonstrate what becomes possible once L0 text-state compilation is reliable.