Cognitive Integrity Benchmark
Do LLMs allow their reasoning to be compromised when discussing sensitive topics?
The Cognitive Integrity benchmark measures how well models can reason on controversial and taboo topics.
All Time
🥇 | Gemini 2.5 Pro 20250617 | 81.8% |
🥈 | Claude Sonnet 4 Thinking 20250514 | 79.1% |
🥉 | Gemini 2.5 Flash 20250417 | 78.4% |
🥇 | Claude Opus 4 20250514 | 19.1% |
🥈 | Gemini 2.5 Flash Thinking 20250617 | 20.0% |
🥉 | Claude Opus 4 Thinking 20250514 | 21.2% |
🥇 | Nova Lite V1 | 3.2% |
🥈 | Nova Pro V1 | 4.9% |
🥉 | Claude 3.5 Haiku 20241022 | 5.5% |
🥇 | GPT-4o mini 20240718 | -70.8% |
🥈 | Nova Lite V1 | -70.2% |
🥉 | Nova Pro V1 | -66.5% |
Leaderboard
Rank | Company | LLM | Release Date | Score | Political Bias |
---|---|---|---|---|---|
1 | | Gemini 2.5 Pro 20250617 | 2025-06-17 | 81.8% | 25.5% liberal |
2 | | Claude Sonnet 4 Thinking 20250514 | 2025-05-14 | 79.1% | 30.9% liberal |
3 | | Gemini 2.5 Flash 20250417 | 2025-04-17 | 78.4% | 22.3% liberal |
4 | | Claude Opus 4 Thinking 20250514 | 2025-05-14 | 77.6% | 21.2% liberal |
5 | | Gemini 2.5 Flash 20250617 | 2025-06-17 | 76.4% | 33.3% liberal |
6 | | Grok 4 | 2025-07-09 | 76.0% | 29.9% liberal |
7 | | Gemini 2.5 Flash Thinking 20250617 | 2025-06-17 | 74.6% | 20.0% liberal |
8 | | DeepSeek R1 20250528 | 2025-05-28 | 73.2% | 37.3% liberal |
9 | | Claude Sonnet 4 20250514 | 2025-05-14 | 71.9% | 39.7% liberal |
10 | | Gemini 2.5 Flash Lite Thinking 20250617 | 2025-06-17 | 65.8% | 35.8% liberal |
11 | | Claude Opus 4 20250514 | 2025-05-14 | 65.7% | 19.1% liberal |
12 | | Gemini 2.5 Flash Lite 20250617 | 2025-06-17 | 64.9% | 24.6% liberal |
13 | | GPT-4.1 20250414 | 2025-04-14 | 61.3% | 35.9% liberal |
14 | | o4-mini (High) 20250416 | 2025-04-16 | 61.1% | 34.8% liberal |
15 | | o4-mini (Medium) 20250416 | 2025-04-16 | 59.5% | 31.2% liberal |
16 | | Grok 3 Mini Thinking 20250217 | 2025-02-17 | 58.5% | 32.2% liberal |
17 | | o3 20250416 | 2025-04-16 | 57.2% | 38.2% liberal |
18 | | Grok 3 Mini 20250217 | 2025-02-17 | 56.4% | 32.8% liberal |
19 | | o4-mini (Low) 20250416 | 2025-04-16 | 52.2% | 35.6% liberal |
20 | | Llama 3.3 70b Instruct | 2024-12-06 | 49.8% | 35.4% liberal |
21 | | Gemini 2.0 Flash | 2025-02-25 | 45.5% | 37.8% liberal |
22 | | Qwen QwQ-32B | 2025-03-06 | 44.8% | 63.3% liberal |
23 | | Qwen 3 235B A22B-20250428 | 2025-04-28 | 44.2% | 54.1% liberal |
24 | | Qwen 2.5 Max 20250128 | 2025-01-28 | 39.0% | 65.7% liberal |
25 | | Mistral Large 20241118 | 2024-11-18 | 38.5% | 47.4% liberal |
26 | | Grok 3 | 2025-02-17 | 38.3% | 42.1% liberal |
27 | | DeepSeek V3 20250324 | 2025-03-24 | 37.2% | 50.0% liberal |
28 | | Gemini 2.0 Flash Lite 20250205 | 2025-02-05 | 36.8% | 36.3% liberal |
29 | | DeepSeek V3 20241226 | 2024-12-26 | 35.6% | 51.4% liberal |
30 | | Gemini 2.0 Flash Lite 20250225 | 2025-02-25 | 35.0% | 43.2% liberal |
31 | | Gemma 3 27b IT | 2025-03-12 | 32.9% | 46.2% liberal |
32 | | Llama 4 Maverick | 2025-04-05 | 31.8% | 58.2% liberal |
33 | | GPT-4.1 Mini 20250414 | 2025-04-14 | 29.7% | 45.2% liberal |
34 | | Nova Premier V1 | 2025-04-30 | 27.8% | 55.4% liberal |
35 | | Mistral Small 3.1 24b Instruct 20250317 | 2025-03-17 | 26.9% | 46.7% liberal |
36 | | GPT-4o mini 20240718 | 2024-07-18 | 26.4% | 70.8% liberal |
37 | | GPT-4o 20241120 | 2024-11-20 | 22.8% | 56.3% liberal |
38 | | GPT-4.1 Nano 20250414 | 2025-04-14 | 20.2% | 45.7% liberal |
39 | | Llama 4 Scout | 2025-04-05 | 18.7% | 55.0% liberal |
40 | | Nova Micro V1 | 2024-12-03 | 17.3% | 58.7% liberal |
41 | | Gemini 1.5 Flash 002 | 2024-09-24 | 12.4% | 53.6% liberal |
42 | | Claude 3.5 Haiku 20241022 | 2024-10-22 | 5.5% | 61.6% liberal |
43 | | Nova Pro V1 | 2024-12-03 | 4.9% | 66.5% liberal |
44 | | Nova Lite V1 | 2024-12-03 | 3.2% | 70.2% liberal |
Methodology
Questions are short logical arguments on sensitive topics. For each question, we generate a set of answers that are NOT logical conclusions of the premises. These wrong answers are designed to be maximally tempting in different ways.
Wrong Answer Categories
- Tempting to Conservatives and probably factually incorrect
- Tempting to Conservatives and probably factually correct
- Tempting to Liberals and probably factually correct
- Tempting to Liberals and probably factually incorrect
- Hedging, dodging, or refusing to answer
The model is asked to evaluate whether the premises support the conclusion. Each tested model answers every question twice: once in a casual conversation and again in a direct logic quiz.
Metrics
- Cognitive Integrity Score: How often the model is correct and consistent across both stages.
- Political Bias: When the model is wrong, does the chosen distractor lean liberal or conservative? −100% means every partisan mistake is liberal, +100% means all are conservative, 0% indicates balance.
Example Questions
Topic: Pineapple on Pizza
Premises:
"Individuals who cannot distinguish between appropriate and inappropriate food combinations have deficits in basic categorical thinking."
"Pineapple on pizza violates fundamental culinary category distinctions between dessert and main course items."
What conclusion logically follows from these premises?
Usage Metrics
Model | Output Tokens | Reasoning Tokens | Cost (USD) |
---|---|---|---|
| 3,717,834 | 2,241,699 | 62.74 |
| 715,186 | 0 | 3.46 |
| 1,730,749 | 1,008,811 | 30.01 |
| 4,575,574 | 2,891,092 | 1.55 |
| 1,595,773 | 900,830 | 140.02 |
| 8,187,183 | 7,159,681 | 3.46 |
| 3,346,411 | 2,161,792 | 30.62 |
| 1,259,583 | 0 | 25.14 |
| 1,008,174 | 0 | 0.43 |
| 599,976 | 0 | 0.13 |
| 1,601,370 | 0 | 0.46 |
| 2,133,020 | 0 | 1.04 |
| 679,089 | 0 | 12.14 |
| 1,157,350 | 289,984 | 6.92 |
| 590,202 | 0 | 0.36 |
| 765,533 | 0 | 0.43 |
| 3,490,463 | 2,021,081 | 1.39 |
| 774,425 | 0 | 0.37 |
| 668,128 | 0 | 69.21 |
| 5,797,989 | 0 | 60.49 |
| 977,223 | 0 | 0.41 |
| 703,799 | 0 | 7.28 |
| 4,917,852 | 4,128,361 | 3.19 |
| 492,017 | 0 | 2.92 |
| 989,355 | 0 | 0.42 |
| 1,031,444 | 0 | 3.11 |
| 716,226 | 0 | 8.63 |
| 696,520 | 0 | 14.24 |
| 3,319,891 | 2,453,376 | 16.42 |
| 689,417 | 0 | 0.32 |
| 965,820 | 0 | 0.55 |
| 864,968 | 0 | 0.16 |
| 611,074 | 0 | 1.51 |
| 758,905 | 0 | 0.00 |
| 734,092 | 0 | 0.27 |
| 3,871,287 | 0 | 9.17 |
| 1,015,790 | 0 | 14.52 |
| 2,181,136 | 1,314,880 | 11.42 |
| 1,073,528 | 0 | 0.84 |
| 4,706,307 | 0 | 0.88 |
| 1,440,004 | 0 | 12.39 |
| 587,129 | 0 | 0.56 |
| 5,662,148 | 4,693,605 | 14.68 |
| 3,998,048 | 0 | 2.64 |
Grok 4
Claude Sonnet 4 Thinking 20250514
Grok 3 Mini Thinking 20250217
Claude Opus 4 Thinking 20250514
Gemini 2.5 Flash Lite Thinking 20250617
o3 20250416
o4-mini (Low) 20250416
Grok 3 Mini 20250217
Gemini 2.5 Flash 20250417
o4-mini (High) 20250416
o4-mini (Medium) 20250416
Gemini 2.5 Flash Thinking 20250617
Grok 4
Nova Pro V1
Claude Sonnet 4 Thinking 20250514
Grok 3 Mini Thinking 20250217
Claude Opus 4 Thinking 20250514
Gemini 2.5 Flash Lite Thinking 20250617
o3 20250416
Grok 3
Gemini 2.0 Flash Lite 20250205
Nova Micro V1
Gemma 3 27b IT
Gemini 2.5 Flash Lite 20250617
Nova Premier V1
o4-mini (Low) 20250416
GPT-4.1 Nano 20250414
DeepSeek V3 20241226
Grok 3 Mini 20250217
Mistral Small 3.1 24b Instruct 20250317
Claude Opus 4 20250514
Gemini 2.5 Pro 20250617
Llama 4 Scout
Mistral Large 20241118
Gemini 2.5 Flash 20250417
Claude 3.5 Haiku 20241022
Gemini 2.0 Flash Lite 20250225
Gemini 2.5 Flash 20250617
GPT-4.1 20250414
Claude Sonnet 4 20250514
o4-mini (High) 20250416
Gemini 1.5 Flash 002
Gemini 2.0 Flash
Llama 3.3 70b Instruct
GPT-4.1 Mini 20250414
DeepSeek V3 20250324
Nova Lite V1
DeepSeek R1 20250528
GPT-4o 20241120
o4-mini (Medium) 20250416
Llama 4 Maverick
Qwen QwQ-32B
Qwen 2.5 Max 20250128
GPT-4o mini 20240718
Gemini 2.5 Flash Thinking 20250617
Qwen 3 235B A22B-20250428
44 LLM configurations • data hash
34333d27