Benchmark methodology

How the 1,238 number is produced.

Full transparency on the headline figures: what the grid is, how decisions are counted, and the exact command to reproduce them. The number measures verdict-space coverage of the deterministic contracts — it is not a production drift rate.

Honest scope. This is a synthetic decision-space grid that enumerates realistic evidence combinations per claim family to confirm every contract is deterministic and exercises its full verdict space. It is not a measurement of how often real claims drift, and no production pilot has run yet. Real rates require an independently adjudicated corpus.

The headline figures (reproduced)

1,238total decisions

12claim families

78.4%gated (REWRITE + REJECT)

12 / 12families reach ≥2 verdicts

Overall mix: ACCEPT 217 (17.5%) · REWRITE 126 (10.2%) · REJECT 845 (68.3%) · HOLD 50 (4.0%). The grid is adversarial by construction (most evidence combinations are deliberately under-supported), so a high REJECT share is expected and intended — it shows the contracts are strict, not that real corpora reject 68%.

Decision mix per claim family

Claim family	N	ACCEPT	REWRITE	REJECT	HOLD	verdicts
statistical_confidence	543	74	74	392	3	4 / 4
financial_metric_claim	364	29	29	302	4	4 / 4
exact_model_solution	182	104	0	76	2	3 / 4
universal_anchor_claim	41	1	5	22	13	4 / 4
causal_mechanism_claim	20	1	3	12	4	4 / 4
evidence_conflict_claim	20	1	1	14	4	4 / 4
multimodal_evidence_claim	20	2	6	8	4	4 / 4
systematic_review_claim	20	1	3	12	4	4 / 4
programming_language_behavior_claim	16	1	3	4	8	4 / 4
reproducibility_check	6	1	1	2	2	4 / 4
claim_transition	3	1	1	0	1	3 / 4
physical_accuracy	3	1	0	1	1	3 / 4

Three families (exact_model_solution, claim_transition, physical_accuracy) have a contract that cannot reach all four verdicts by design — e.g. a binary exactness check has no REWRITE state. That is disclosed, not hidden.

Reproduce it yourself

Every figure on this page comes from one deterministic script. Same inputs → same counts, on any machine:

$ python3 benchmarks/family_decision_mix.py
# -> outputs/family_decision_mix.json + .md
family_decision_mix: N=1238 across 12 families
overall: {'ACCEPT': 217, 'REWRITE': 126, 'REJECT': 845, 'HOLD': 50}

Provenance · schema capas-claim-payload-v3 · engine UI v13 · grid generator benchmarks/family_decision_mix.py · these figures regenerate in CI on every commit. The synthetic grid validates the contracts' logic; it does not represent fraud-detection from raw paper text, nor a real-world drift rate. Production figures will be published only after an independently adjudicated pilot, with false-reject / false-accept measured against domain experts.