Artifacts
Methodology, corpus, reproducibility.
Benchmark methodology
LawVM benchmarks replay output against real-world publication surfaces. For Finland, that means comparing replayed point-in-time text against the Finlex editorial consolidation.
Two metrics:
- Levenshtein text distance — character-level normalized edit distance. Mean: 0.65%.
- Structural section error — section-level structural divergence. Mean: 4.25%.
Some divergences mean LawVM is right and Finlex is wrong. The residual taxonomy (15 root cause categories) classifies each mismatch so that evaluation is not a single number but a typed evidence surface.
Corpus definition
690 statutes curated from 3,591 amended Finnish statutes. Curation criteria (all structural, no temporal filtering):
- Base statute XML exists in the archived source corpus
- XML is parseable and contains section structure
- Oracle consolidated XML exists with non-empty body
- All amendment texts available in the archive
- At least one amendment
Decade span: 1920s–2020s. Amendment counts per statute: 1 to 238. Not hand-picked for success — curated for replayability. The curation script is scripts/curate_corpus.py.
Current benchmark snapshot
| Metric | Value |
|---|---|
| Statutes | 690 |
| Mean Levenshtein distance | 0.65% |
| Mean structural error | 4.25% |
| Perfect text match | ~420 |
| Perfect structural match | 367 |
| ≥95% structural | 490 |
| <90% structural | 104 |
Run: 2026-04-16, mode: finlex_oracle.
Golden dataset
77+ verified divergence entries (as of 2026-04-16, growing). Each entry documents: statute ID, title, verdict, root cause, Finnish prose summary, affected sections. Format: one YAML file per statute. Schema in notes/verified_finlex_errors/README.md.
Verdicts: lawvm_ok (Finlex is wrong), mixed (both have issues), source_defect (source material broken), lawvm_bug (LawVM is wrong).
Reproducibility
uv sync
uv run lawvm bench --mode finlex_oracle --label reproduce
Replays all 690 statutes and reports metrics. Requires data/finlex.farchive (Finnish source and oracle archive). Results depend on archive contents at time of run — oracle consolidation surfaces change as Finlex editors update them. Frozen archive benchmarks are stable.
The source archive (finlex.farchive) is built from Finlex open data batch downloads. The acquisition scripts and benchmark tooling are in the repository.
Downloads
Artifact releases (Zenodo DOI-backed) are planned for:
- Frozen corpus snapshot
- Software release archive
- Golden dataset export
- Publication database (SQLite)
Status: Pending corpus freeze. Links will appear here when available.