Artifacts
Methodology, corpus, reproducibility.
Benchmark methodology
LawVM benchmarks replay output against real-world publication surfaces. For Finland, that means comparing replayed point-in-time text against the Finlex editorial consolidation.
Two metrics:
- Levenshtein text distance: character-level normalized edit distance. Mean: 0.65%.
- Structural section error: section-level structural divergence. Mean: 4.25%.
Some divergences become high-confidence candidate findings when primary sources support LawVM over the Finlex consolidation. The residual taxonomy classifies each mismatch so that evaluation is not a single number but a typed evidence surface.
Corpus definition
The current Finnish alpha corpus is curated from a larger set of amended Finnish statutes. Curation criteria are structural, not success-based:
- Base statute XML exists in the archived source corpus
- XML is parseable and contains section structure
- Oracle consolidated XML exists with non-empty body
- All amendment texts available in the archive
- At least one amendment
Decade span: 1920s–2020s. Amendment counts per statute: 1 to 238. Curation targets replayability rather than success. The curation script is scripts/curate_corpus.py.
Current benchmark snapshot
| Metric | Value |
|---|---|
| Corpus | Finnish alpha corpus |
| Mean Levenshtein distance | 0.65% |
| Mean structural error | 4.25% |
| Perfect text match | ~420 |
| Perfect structural match | 367 |
| ≥95% structural | 490 |
| <90% structural | 104 |
Benchmark snapshot: 2026-04-16, mode: finlex_oracle. Figures are provisional and tied to the frozen source/oracle archive.
Golden dataset
The v0.1 alpha evidence exposes hundreds of replay-vs-Finlex divergences for triage. A subset of 22 high-confidence meaningful candidate findings has been reported to Finlex. These remain candidate findings pending confirmation by Finlex or another competent authority.
Internal entries document statute ID, title, verdict, root cause, Finnish prose summary, affected sections, and source evidence. The Finnish evidence viewer exists in the repository, but it is not linked from the public website surface yet. Public DOI-backed exports are planned but not yet published.
Verdicts: lawvm_ok (Finlex is wrong), mixed (both have issues), source_defect (source material broken), lawvm_bug (LawVM is wrong).
Reproducibility
uv sync
uv run lawvm bench --mode finlex_oracle --label reproduce
Replays the Finnish alpha corpus and reports metrics. Requires data/finlex.farchive (Finnish source and oracle archive). Results depend on archive contents at time of run: oracle consolidation surfaces change as Finlex editors update them. Frozen archive benchmarks are stable.
The source archive (finlex.farchive) is built from Finlex open data batch downloads. The acquisition scripts and benchmark tooling are in the repository.
Downloads
Artifact releases (Zenodo DOI-backed) are planned for:
- Source and output snapshot for independent evaluation
- Software release archive
- Golden dataset export
- Publication database (SQLite)
Status: Pending v0.1 artifact packaging. Links will appear here when available.