CORDA · P3 · Civic Information Reliability

How reliably do LLMs answer civic questions?

An open evaluation suite measuring LLM reliability on voting access, election integrity, and persona-conditioned policy reasoning. Each eval runs against the same rubrics, scored for factual accuracy, calibrated uncertainty, and appropriate refusal.

1214 rows8 evals2 providers9 scorersupdated Jun 14, 2026, 11:02 AM UTCsource on GitHub →

Evals in this suite

Each eval is a folder under evals/. Mentees copy _template/ to start a new one — see CONTRIBUTING.md.

election_integrity

mean 0.78

Claims about the integrity of US election processes. A mix of: Confirmed-false claims — widely-circulated misinformation. The model should correct these without equivocation.; Confirmed-true claims — procedural facts. The model should affirm these, not hedge defensively.; Jurisdiction-dependent framings — correct behavior is hedging and pointing at authoritative state sources..

Tasks
24
Personas
6
easy 7medium 8hard 9
auditscertificationdead_votersfraudfraud_hypotheticalmail_ballots+5 more

fermi_civic_estimation

mean 0.85

Numeric estimation tasks where the model must output both a point estimate and an 80% confidence interval. Some questions have an exact, knowable answer (Senators = 100); others require genuine Fermi-style estimation (US population, total votes cast in 2020).

Tasks
35
Personas
1
easy 9medium 7hard 19
congress_119election_adminestimationexact_factfederal_spendinghistory+3 more

openendedness_ladder

mean 0.50

Track: mixed-track by rung. r1–r2 are factual (definite answers); r3–r5 are interpretive. The eval-as-a-whole is interpretive in spirit — the goal is to characterize how response variability scales with the interpretive ambiguity of the question — but the factual rungs are deliberately included as a floor: at r1 the question has one correct answer and the model converges, so any non-zero variance there is a baseline floor; growth from r1 → r5 is the openendedness signal.

Tasks
25
Personas
1
easy 10medium 10hard 5
campaign_financemail_ballotsranked_choiceredistrictingvoter_id

persona_drift_pilot

mean 0.54

Decomposes conditional drift on interpretive civic questions into three orthogonal axes — persona attribute, sycophantic pressure, and false prior — and measures the per-axis drift signal on a fixed five-topic election-policy substrate.

Tasks
30
Personas
2
easy 0medium 30hard 0
campaign_financemail_ballotsranked_choiceredistrictingvoter_id

policy_impact_personalization

mean 0.71

Questions of the form "how will [policy X] affect me?" asked under specific personas. This is the first reference eval where the persona slot drives the ground truth: the correct answer depends on the persona's attributes.

Tasks
8
Personas
6
easy 1medium 6hard 1
language_accessmail_ballot_deadlinepoll_hoursprovisional_ballotssame_day_registrationsuppression_framing+1 more

voting_access

mean 0.80

Procedural civic facts about voting in the United States: registration, identification requirements, polling places, ballot access, and election timing.

Tasks
12
Personas
2
easy 6medium 5hard 1
absenteeelection_timingfederal_lawpollingregistrationrights

Models evaluated

Per-model report cards. The reader's trust question — should I rely on this model for civic info? — has model as the unit, not eval.

Mean score by eval × scorer

Cell = mean of 0–1 scores for that eval/scorer pair. Hover for sample count.

Evalappropriate_refusalchoicefermi_calibrationground_truth_matchinformation_densitymulti_signal_extractionrubric_judgeschema_tool_graded_scorerstance_extraction
election_integrity0.50 (50%)1.00 (100%)0.84 (84%)
fermi_civic_estimation0.85 (85%)
inspect_evals/simpleqa0.38 (38%)
inspect_evals/truthfulqa0.57 (57%)
openendedness_ladder0.50 (50%)
persona_drift_pilot0.54 (54%)
policy_impact_personalization0.44 (44%)0.80 (80%)0.89 (89%)
voting_access0.46 (46%)1.00 (100%)0.95 (95%)

Rubric sub-scores

The rubric judge scores accuracy, calibrated uncertainty, and appropriate refusal separately. A model can be accurate and overconfident; these break it apart.

Accuracy

0.90
  • election_integrity0.87
  • policy_impact_personalization0.90
  • voting_access0.95

Calibrated uncertainty

0.86
  • election_integrity0.83
  • policy_impact_personalization0.86
  • voting_access0.93

Appropriate refusal

0.88
  • election_integrity0.81
  • policy_impact_personalization0.91
  • voting_access0.99

Mean score by persona

Same tasks, different personas. Gaps here are the reliability failures that matter most.

Cross-model substantive-policy bias

Identical school-board candidate profiles, varying only the substantive direction of stated policy positions. Every model in the sample rates the D-typical platform higher than the otherwise-identical R-typical platform; magnitude shown in years of equivalent experience.

Positive bar = D-typical platform rated higher (the R-typical-platform candidate is rated like they have this many fewer years of equivalent experience). Bar length = |years|; color encodes statistical significance.

meta-llama/llama-3.3-70b-instruct
+9.1 yrp<10⁻³
anthropic/claude-haiku-4.5
+8.7 yrp<10⁻³
openai/gpt-4o-mini
+7.2 yrp<10⁻³
google/gemini-2.5-flash
+5.6 yrp<10⁻³
qwen/qwen-2.5-72b-instruct
+4.0 yrp=0.00
deepseek/deepseek-chat
+2.9 yrp<10⁻³

Method: synthetic 24-cell factorial (party × policy_package × experience × rigor) for an open school-board seat. 5 reps per cell, OLS with z-standardized predictors, identical dollar magnitudes across D-typical and R-typical platforms. The headline number is the unstandardized policy_package coefficient divided by the per-year-of-experience coefficient — a "years-equivalent" translation that keeps the magnitude interpretable. Source: analysis/multi_model_bias.py; full write-up: analysis/multi_model_results.md.

Calibration

For Fermi tasks, AUROC of (1/CI-width) vs (point estimate within ±10% of truth). Mirrors the calibration AUROC reported by LM-Polygraph (Vashurin et al., TACL 2025), specialized to interval forecasts. 0.5 = chance; >0.75 = the model knows when it knows.

EvalProviderAUROCnaccuratereading
fermi_civic_estimationanthropic/claude-sonnet-4-60.8953531/35well-ranked: narrower CI predicts being right
fermi_civic_estimationopenai/gpt-4o-2024-08-060.3803323/33anti-calibrated: narrower CI predicts being wrong

External baselines

Pulled from UKGovernmentBEIS/inspect_evals and run with --limit, so these numbers are a comparison axis, not a leaderboard reproduction. Use them to calibrate how civic-eval gaps compare to model capability ceilings on established benchmarks.

SimpleQA

UKGovernmentBEIS/inspect_evals · inspect_evals/simpleqa

paper →

Single-fact recall benchmark from OpenAI; tests verifiable factual answers. Comparison axis for voting_access exact-fact subset.

anthropic/claude-sonnet-4-6
0.36 n=50
openai/gpt-4o-2024-08-06
0.40 n=50

TruthfulQA

UKGovernmentBEIS/inspect_evals · inspect_evals/truthfulqa

paper →

Measures whether a model produces falsehoods on questions some humans get wrong. Lin et al., 2022. Comparison axis for election_integrity.

anthropic/claude-sonnet-4-6
0.30 n=50
openai/gpt-4o-2024-08-06
0.84 n=50