Full transparency on the headline figures: what the grid is, how decisions are counted, and the exact command to reproduce them. The number measures verdict-space coverage of the deterministic contracts — it is not a production drift rate.
Overall mix: ACCEPT 217 (17.5%) · REWRITE 126 (10.2%) · REJECT 845 (68.3%) · HOLD 50 (4.0%). The grid is adversarial by construction (most evidence combinations are deliberately under-supported), so a high REJECT share is expected and intended — it shows the contracts are strict, not that real corpora reject 68%.
| Claim family | N | ACCEPT | REWRITE | REJECT | HOLD | verdicts |
|---|---|---|---|---|---|---|
| statistical_confidence | 543 | 74 | 74 | 392 | 3 | 4 / 4 |
| financial_metric_claim | 364 | 29 | 29 | 302 | 4 | 4 / 4 |
| exact_model_solution | 182 | 104 | 0 | 76 | 2 | 3 / 4 |
| universal_anchor_claim | 41 | 1 | 5 | 22 | 13 | 4 / 4 |
| causal_mechanism_claim | 20 | 1 | 3 | 12 | 4 | 4 / 4 |
| evidence_conflict_claim | 20 | 1 | 1 | 14 | 4 | 4 / 4 |
| multimodal_evidence_claim | 20 | 2 | 6 | 8 | 4 | 4 / 4 |
| systematic_review_claim | 20 | 1 | 3 | 12 | 4 | 4 / 4 |
| programming_language_behavior_claim | 16 | 1 | 3 | 4 | 8 | 4 / 4 |
| reproducibility_check | 6 | 1 | 1 | 2 | 2 | 4 / 4 |
| claim_transition | 3 | 1 | 1 | 0 | 1 | 3 / 4 |
| physical_accuracy | 3 | 1 | 0 | 1 | 1 | 3 / 4 |
Three families (exact_model_solution, claim_transition, physical_accuracy) have a contract that cannot reach all four verdicts by design — e.g. a binary exactness check has no REWRITE state. That is disclosed, not hidden.
Every figure on this page comes from one deterministic script. Same inputs → same counts, on any machine:
$ python3 benchmarks/family_decision_mix.py # -> outputs/family_decision_mix.json + .md family_decision_mix: N=1238 across 12 families overall: {'ACCEPT': 217, 'REWRITE': 126, 'REJECT': 845, 'HOLD': 50}
capas-claim-payload-v3 · engine UI v13 · grid generator benchmarks/family_decision_mix.py · these figures regenerate in CI on every commit. The synthetic grid validates the contracts' logic; it does not represent fraud-detection from raw paper text, nor a real-world drift rate. Production figures will be published only after an independently adjudicated pilot, with false-reject / false-accept measured against domain experts.