CORDA · P3 · Civic Information Reliability

How reliably do LLMs answer civic questions?

An open evaluation suite measuring LLM reliability on voting access, election integrity, and persona-conditioned policy reasoning. Each eval runs against the same rubrics, scored for factual accuracy, calibrated uncertainty, and appropriate refusal.

1214 rows8 evals2 providers9 scorersupdated Jun 14, 2026, 11:02 AM UTCsource on GitHub →

Evals in this suite

Each eval is a folder under evals/. Mentees copy _template/ to start a new one — see CONTRIBUTING.md.

election_integrity

mean 0.78

Claims about the integrity of US election processes. A mix of: Confirmed-false claims — widely-circulated misinformation. The model should correct these without equivocation.; Confirmed-true claims — procedural facts. The model should affirm these, not hedge defensively.; Jurisdiction-dependent framings — correct behavior is hedging and pointing at authoritative state sources..

Tasks: 24
Personas: 6

easy 7medium 8hard 9

auditscertificationdead_votersfraudfraud_hypotheticalmail_ballots+5 more

fermi_civic_estimation

mean 0.85

Numeric estimation tasks where the model must output both a point estimate and an 80% confidence interval. Some questions have an exact, knowable answer (Senators = 100); others require genuine Fermi-style estimation (US population, total votes cast in 2020).

Tasks: 35
Personas: 1

easy 9medium 7hard 19

congress_119election_adminestimationexact_factfederal_spendinghistory+3 more

openendedness_ladder

mean 0.50

Track: mixed-track by rung. r1–r2 are factual (definite answers); r3–r5 are interpretive. The eval-as-a-whole is interpretive in spirit — the goal is to characterize how response variability scales with the interpretive ambiguity of the question — but the factual rungs are deliberately included as a floor: at r1 the question has one correct answer and the model converges, so any non-zero variance there is a baseline floor; growth from r1 → r5 is the openendedness signal.

Tasks: 25
Personas: 1

easy 10medium 10hard 5

campaign_financemail_ballotsranked_choiceredistrictingvoter_id

persona_drift_pilot

mean 0.54

Decomposes conditional drift on interpretive civic questions into three orthogonal axes — persona attribute, sycophantic pressure, and false prior — and measures the per-axis drift signal on a fixed five-topic election-policy substrate.

Tasks: 30
Personas: 2

easy 0medium 30hard 0

campaign_financemail_ballotsranked_choiceredistrictingvoter_id

policy_impact_personalization

mean 0.71

Questions of the form "how will [policy X] affect me?" asked under specific personas. This is the first reference eval where the persona slot drives the ground truth: the correct answer depends on the persona's attributes.

Tasks: 8
Personas: 6

easy 1medium 6hard 1

language_accessmail_ballot_deadlinepoll_hoursprovisional_ballotssame_day_registrationsuppression_framing+1 more

voting_access

mean 0.80

Procedural civic facts about voting in the United States: registration, identification requirements, polling places, ballot access, and election timing.

Tasks: 12
Personas: 2

easy 6medium 5hard 1

absenteeelection_timingfederal_lawpollingregistrationrights

Models evaluated

Per-model report cards. The reader's trust question — should I rely on this model for civic info? — has model as the unit, not eval.

anthropic/claude-sonnet-4-6

mean 0.56

Evals: 8
Rows: 607
Flagged: 291

Of 291 flagged failures, 0 hedged .

openai/gpt-4o-2024-08-06

mean 0.60

Evals: 8
Rows: 607
Flagged: 288

Of 288 flagged failures, 0 hedged .

Mean score by eval × scorer

Cell = mean of 0–1 scores for that eval/scorer pair. Hover for sample count.

Provider

Eval	appropriate_refusal	choice	fermi_calibration	ground_truth_match	information_density	multi_signal_extraction	rubric_judge	schema_tool_graded_scorer	stance_extraction
election_integrity	0.50 (50%)	—	—	1.00 (100%)	—	—	0.84 (84%)	—	—
fermi_civic_estimation	—	—	0.85 (85%)	—	—	—	—	—	—
inspect_evals/simpleqa	—	—	—	—	—	—	—	0.38 (38%)	—
inspect_evals/truthfulqa	—	0.57 (57%)	—	—	—	—	—	—	—
openendedness_ladder	—	—	—	—	—	0.50 (50%)	—	—	—
persona_drift_pilot	—	—	—	—	—	—	—	—	0.54 (54%)
policy_impact_personalization	0.44 (44%)	—	—	—	0.80 (80%)	—	0.89 (89%)	—	—
voting_access	0.46 (46%)	—	—	1.00 (100%)	—	—	0.95 (95%)	—	—

Rubric sub-scores

The rubric judge scores accuracy, calibrated uncertainty, and appropriate refusal separately. A model can be accurate and overconfident; these break it apart.

Provider

Accuracy

0.90

election_integrity0.87
policy_impact_personalization0.90
voting_access0.95

Calibrated uncertainty

0.86

election_integrity0.83
policy_impact_personalization0.86
voting_access0.93

Appropriate refusal

0.88

election_integrity0.81
policy_impact_personalization0.91
voting_access0.99

Mean score by persona

Same tasks, different personas. Gaps here are the reliability failures that matter most.

Provider

Cross-model substantive-policy bias

Identical school-board candidate profiles, varying only the substantive direction of stated policy positions. Every model in the sample rates the D-typical platform higher than the otherwise-identical R-typical platform; magnitude shown in years of equivalent experience.

Positive bar = D-typical platform rated higher (the R-typical-platform candidate is rated like they have this many fewer years of equivalent experience). Bar length = |years|; color encodes statistical significance.

meta-llama/llama-3.3-70b-instruct

+9.1 yrp<10⁻³

anthropic/claude-haiku-4.5

+8.7 yrp<10⁻³

openai/gpt-4o-mini

+7.2 yrp<10⁻³

google/gemini-2.5-flash

+5.6 yrp<10⁻³

qwen/qwen-2.5-72b-instruct

+4.0 yrp=0.00

deepseek/deepseek-chat

+2.9 yrp<10⁻³

Method: synthetic 24-cell factorial (party × policy_package × experience × rigor) for an open school-board seat. 5 reps per cell, OLS with z-standardized predictors, identical dollar magnitudes across D-typical and R-typical platforms. The headline number is the unstandardized policy_package coefficient divided by the per-year-of-experience coefficient — a "years-equivalent" translation that keeps the magnitude interpretable. Source: analysis/multi_model_bias.py; full write-up: analysis/multi_model_results.md.

Calibration

For Fermi tasks, AUROC of (1/CI-width) vs (point estimate within ±10% of truth). Mirrors the calibration AUROC reported by LM-Polygraph (Vashurin et al., TACL 2025), specialized to interval forecasts. 0.5 = chance; >0.75 = the model knows when it knows.

Eval	Provider	AUROC	n	accurate	reading
fermi_civic_estimation	anthropic/claude-sonnet-4-6	0.895	35	31/35	well-ranked: narrower CI predicts being right
fermi_civic_estimation	openai/gpt-4o-2024-08-06	0.380	33	23/33	anti-calibrated: narrower CI predicts being wrong

External baselines

Pulled from UKGovernmentBEIS/inspect_evals and run with --limit, so these numbers are a comparison axis, not a leaderboard reproduction. Use them to calibrate how civic-eval gaps compare to model capability ceilings on established benchmarks.

SimpleQA

UKGovernmentBEIS/inspect_evals · inspect_evals/simpleqa

paper →

Single-fact recall benchmark from OpenAI; tests verifiable factual answers. Comparison axis for voting_access exact-fact subset.

anthropic/claude-sonnet-4-6

0.36 n=50

openai/gpt-4o-2024-08-06

0.40 n=50

TruthfulQA

UKGovernmentBEIS/inspect_evals · inspect_evals/truthfulqa

paper →

Measures whether a model produces falsehoods on questions some humans get wrong. Lin et al., 2022. Comparison axis for election_integrity.

anthropic/claude-sonnet-4-6

0.30 n=50

openai/gpt-4o-2024-08-06

0.84 n=50