How to Read LawVM Benchmark Numbers
What the numbers say, what they don't, and why a single score is not enough.
What the benchmark measures
LawVM's Finland benchmark compares replayed point-in-time text against the Finlex editorial consolidation. Two metrics:
- Levenshtein text distance — character-level normalized edit distance. Current mean: 0.65%.
- Structural section error — section-level divergence (missing sections, extra sections, content mismatches). Current mean: 4.25%.
The comparison surface is Finlex's consolidated XML, accessed through the archived source corpus.
What the benchmark does NOT measure
Legal correctness. The benchmark measures agreement with an editorial consolidation, not agreement with the law. Finlex consolidated texts are informational, not legally binding. When LawVM and Finlex disagree, it is not obvious who is right without checking primary sources.
Completeness. The Finnish alpha corpus is a replayable subset of a larger amended-statute universe. Exclusions are structural (missing base XML, missing oracle, missing amendment texts), not a judgment that excluded statutes are "too hard."
All amendment types. The benchmark covers statutes with section structure. Statutes that are primarily hcontainer-only (tables, schedules, unstructured content) are excluded from the curated corpus.
High similarity does not mean correct
A statute at 100% similarity means LawVM and Finlex produce identical text. That text could still be wrong — if both systems fail the same way (e.g., both miss a corrigendum, or both reproduce a source defect).
LawVM's corrigendum pipeline illustrates this: applying published corrections to 44 statutes makes LawVM more legally accurate, but lowers the similarity score because Finlex also didn't apply those corrections.
Low similarity does not mean wrong
A statute at 60% similarity could mean:
- LawVM has a serious replay bug (system is wrong)
- Finlex never applied several amendments (oracle is stale)
- The source XML is structurally corrupted (nobody is right)
- Finlex editors restructured the content for readability (editorial divergence)
- A combination of all of the above
The similarity number points to where to investigate. The residual taxonomy says what you find when you do.
The right way to read the numbers
The benchmark is a triage instrument. It tells you:
- Where the largest divergences are
- Which statutes deserve investigation next
- Whether a code change improves or worsens the comparison surface
- Whether a known divergence class is shrinking or growing
The real verification loop is manual residual review: investigating each divergence against primary sources (Säädöskokoelma), classifying the root cause, and recording the finding in the golden dataset.
That is why the project maintains both: aggregate metrics for triage, plus 22 reported high-confidence candidate findings and hundreds of divergences still being classified.