Benchmark methodology

How the 1,238 number is produced.

Full transparency on the headline figures: what the grid is, how decisions are counted, and the exact command to reproduce them. The number measures verdict-space coverage of the deterministic contracts — it is not a production drift rate.

Honest scope. This is a synthetic decision-space grid that enumerates realistic evidence combinations per claim family to confirm every contract is deterministic and exercises its full verdict space. It is not a measurement of how often real claims drift, and no production pilot has run yet. Real rates require an independently adjudicated corpus.

The headline figures (reproduced)

1,238total decisions
12claim families
78.4%gated (REWRITE + REJECT)
12 / 12families reach ≥2 verdicts

Overall mix: ACCEPT 217 (17.5%) · REWRITE 126 (10.2%) · REJECT 845 (68.3%) · HOLD 50 (4.0%). The grid is adversarial by construction (most evidence combinations are deliberately under-supported), so a high REJECT share is expected and intended — it shows the contracts are strict, not that real corpora reject 68%.

Decision mix per claim family

Claim familyNACCEPTREWRITEREJECTHOLDverdicts
statistical_confidence543747439234 / 4
financial_metric_claim364292930244 / 4
exact_model_solution18210407623 / 4
universal_anchor_claim411522134 / 4
causal_mechanism_claim20131244 / 4
evidence_conflict_claim20111444 / 4
multimodal_evidence_claim2026844 / 4
systematic_review_claim20131244 / 4
programming_language_behavior_claim1613484 / 4
reproducibility_check611224 / 4
claim_transition311013 / 4
physical_accuracy310113 / 4

Three families (exact_model_solution, claim_transition, physical_accuracy) have a contract that cannot reach all four verdicts by design — e.g. a binary exactness check has no REWRITE state. That is disclosed, not hidden.

Reproduce it yourself

Every figure on this page comes from one deterministic script. Same inputs → same counts, on any machine:

$ python3 benchmarks/family_decision_mix.py
# -> outputs/family_decision_mix.json + .md
family_decision_mix: N=1238 across 12 families
overall: {'ACCEPT': 217, 'REWRITE': 126, 'REJECT': 845, 'HOLD': 50}
Provenance · schema capas-claim-payload-v3 · engine UI v13 · grid generator benchmarks/family_decision_mix.py · these figures regenerate in CI on every commit. The synthetic grid validates the contracts' logic; it does not represent fraud-detection from raw paper text, nor a real-world drift rate. Production figures will be published only after an independently adjudicated pilot, with false-reject / false-accept measured against domain experts.