Score claim calibration against the diff that actually shipped
Empirical ratio of session-report claims that hold against the merged PR diff—cheap sycophancy pressure metric.
Pain. Session reports and agent prose claim facts about the repo; if nobody checks claims against the merged PR, sycophancy and confabulation look identical to accuracy in summaries.
What changed. calibration_score compares session-report claims to the merged PR diff with an empirical ratio—claims that hold versus claims made—surfaced on reviewer-adjacent rollups.
Why it matters. This is a cheap, automatable pressure metric: it does not replace human audit, but it gives calibration diagrams a numerator/denominator tied to ground truth in git.
How to try it
- Complete a session report with explicit file-level claims, merge the PR, and inspect the calibration rollup for that assignment lineage.
- Compare scores across models or prompt templates when running controlled repeats.
Shipped in 1dad922 — feat(DP-75): empirical calibration_score from session-report claims vs PR diff (#52)