Parsing Amendment Language with Conventional Grammars

Conventional parsers still work when the drafting system is structured enough.

Elias Kunnas

The claim

Legislative amendment language is a semi-formal instruction language embedded in ordinary human legal prose. In jurisdictions with mature drafting guidance, the amendment preamble and operative text follow conventions tight enough for conventional grammars to parse deterministically.

LawVM demonstrates this on the Finnish alpha corpus: hundreds of replay cases and thousands of amendment acts, parsed with PEG grammars into typed clause structures. No LLMs, no statistical models, no training data, and no rewrite of legislation into a programming language. The parser is deterministic: the same input always produces the same parse.

What prior work already exists

Parsing amendment language with conventional tools is not new. Real prior art exists:

Arnold-Moore (1990s) parsed Tasmanian amendment legislation using ATN (augmented transition network) parsers as part of the EnAct system. This is the earliest serious work on automatic amendment processing.
Ogawa, Inagaki, Toyama (2008) formalized Japanese amendment sentences with regular expressions and achieved high accuracy in a bounded corpus.
Bolioli et al. presented a legislative grammar for explicit text amendment in Italian law.
Spinosa et al. described NLP-based metadata extraction for Italian legal text consolidation covering repeal, substitution, and integration.
Polish ICAIL 2021 work framed amendment extraction as an automatic extraction task over statutory law.

These systems show that amendment parsing and automated consolidation have a serious history. LawVM's claim is not that parsing amendment language is new. The stronger claim is that ordinary amendment streams can be compiled end-to-end into auditable legal text-state: deterministic parsing, typed target resolution, replay over provision timelines, source/oracle divergence typing, and open empirical evaluation on the Finnish corpus.

Parsing amendment language is not new. LawVM's contribution is an open replay compiler that carries parsed amendment instructions through typed operations, temporal materialization, and explicit residual evidence.

Why Finnish amendment language is unusually parseable

Finnish legislative drafting follows the Lainkirjoittajan opas (legislative drafting guide), which prescribes the structure of amendment preambles (johtolause) with considerable precision:

Actions follow a prescribed order: kumotaan (repeal), muutetaan (amend), lisätään (insert)
The preamble ends with seuraavasti: ("as follows:")
Target addresses use hierarchical structural paths: lain 12 §:n 2 momentti
Multi-part amendments list all affected provisions in the preamble before the operative text
The operative text follows the preamble in section order

This is not accidental. Legislative drafting conventions evolved to make amendment meaning deterministic for human readers. The same properties that make amendments readable by lawyers make them parseable by grammars.

The johtolause is, in effect, a domain-specific instruction language. It has a grammar. LawVM exploits that grammar with a PEG parser that extracts clause structure, target addresses, action types, and payload boundaries.

What LawVM adds beyond extraction

Parsing is the first step. The compiler pipeline continues:

Clause surface: typed AST of amendment instructions
Payload extraction: amendment body text isolated and normalized
Elaboration: meaning recovery against the live statute state (what does "2 momentti" mean given the current structure of this section?)
Canonical operations: typed operations (replace, repeal, insert, renumber, text-replace) with resolved targets
Replay: operations applied to statute tree
Timeline construction: provision versions organized temporally
Materialization: point-in-time text produced
Residual classification: divergences from oracle typed and explained

The parse is necessary but not sufficient; the compiler pipeline and evidence surface are the contribution.

Where grammars break

Not all amendment text is cleanly parseable. LawVM encounters:

Implicit scope inheritance — "3, 4, 6 ja 7 luku" where bare numbers inherit the trailing "luku" (chapter). A flat parser sees numbers, not chapters.
Omission semantics — <hcontainer name="omission"/> in XML means different things in different contexts (unchanged prefix, unchanged suffix, unchanged middle).
Flattened list items — numbered items encoded as sibling subsections instead of paragraph children.
Content-only continuations — split paragraph text across XML elements without labels.
Body-root fallbacks — generic preambles that don't name specific targets, forcing whole-section inference.

For these cases, LawVM uses quirks-mode recovery: heuristic normalization with explicit provenance. The architectural rule is that every recovery affecting legal text, structure, target resolution, or timeline selection must be named, witnessed, and rejectable in strict mode. Much of the engineering work in LawVM is turning historical ad hoc repairs into owned, typed recovery rules.

Why Deterministic Replay Matters Now

The current moment in legal AI is dominated by LLM-based approaches. Large language models can do many things well, but multi-step amendment replay is not one of them. Prior et al. (NLLP 2025) showed that exact-match rates degrade badly on longer amendment chains, and that LLM consolidation outputs must remain drafts requiring expert review.

Deterministic parsing and symbolic replay are not competing with LLMs for the same task. They are solving a different problem: exact, reproducible, auditable text-state compilation where every step is inspectable and every failure is typed. That is infrastructure, not generation.

Conventional parsers can extract the structure of a johtolause. Legislative amendment language was designed to be unambiguous for human legal readers, and conventional tools can exploit that design. The surprising fact is how rarely legal information systems have attempted full amendment replay.