How to Read LawVM Benchmark Numbers

What the numbers say, what they don't, and why a single score is not enough.

Elias Kunnas

What the benchmark measures

LawVM's Finland benchmark compares replayed point-in-time text against the Finlex editorial consolidation. Two metrics:

Levenshtein text distance: character-level normalized edit distance. Current mean: 0.65%.
Structural section error: section-level divergence (missing sections, extra sections, content mismatches). Current mean: 4.25%.

The comparison surface is Finlex's consolidated XML, accessed through the archived source corpus.

What the benchmark does NOT measure

Legal correctness. The benchmark measures agreement with an editorial consolidation, not agreement with the law. Finlex consolidated texts are informational, not legally binding. When LawVM and Finlex disagree, it is not obvious who is right without checking primary sources.

Completeness. The Finnish alpha corpus is a replayable subset of a larger amended-statute universe. Exclusions are structural (missing base XML, missing oracle, missing amendment texts), not a judgment that excluded statutes are "too hard."

All amendment types. The benchmark covers statutes with section structure. Statutes that are primarily hcontainer-only (tables, schedules, unstructured content) are excluded from the curated corpus.

What High Similarity Means

A statute at 100% similarity means LawVM and Finlex produce identical text. Shared defects can still exist if both systems fail the same way, for example by missing a corrigendum or reproducing a source defect.

LawVM's corrigendum pipeline illustrates this: applying published corrections to 44 statutes makes LawVM more legally accurate, but lowers the similarity score because Finlex also didn't apply those corrections.

What Low Similarity Means

A statute at 60% similarity could mean:

LawVM has a serious replay bug (system is wrong)
Finlex never applied several amendments (oracle is stale)
The source XML is structurally corrupted (nobody is right)
Finlex editors restructured the content for readability (editorial divergence)
A combination of all of the above

The similarity number points to where to investigate. The residual taxonomy says what you find when you do.

The right way to read the numbers

The benchmark is a triage instrument. It tells you:

Where the largest divergences are
Which statutes deserve investigation next
Whether a code change improves or worsens the comparison surface
Whether a known divergence class is shrinking or growing

The real verification loop is manual residual review: investigating each divergence against primary sources (Säädöskokoelma), classifying the root cause, and recording the finding in the golden dataset.

That is why the project maintains both: aggregate metrics for triage, plus 22 reported high-confidence candidate findings and hundreds of divergences still being classified.