Cognitive Integrity Benchmark
The Cognitive Integrity Benchmark (CIB) measures models' logical reasoning on contentious social, ethical, and scientific issues.
All Time
🥇 | Gemini 2.5 Pro 20250617 | 81.8% | |
🥈 | Gemini 2.5 Flash 20250925 | 81.6% | |
🥉 | Gemini 2.5 Flash Thinking 20250925 | 79.2% |
🥇 | Gemini 2.5 Flash Thinking 20250925 | 16.4% | |
🥈 | Gemini 2.5 Flash 20250925 | 17.6% | |
🥉 | Gemini 2.5 Flash Lite 20250925 | 18.2% |
🥇 | Nova Lite V1 | 3.2% | |
🥈 | Nova Pro V1 | 4.9% | |
🥉 | GPT-5 Nano 20250807 | 5.0% |
🥇 | GPT-4o mini 20240718 | -70.8% | |
🥈 | Nova Lite V1 | -70.2% | |
🥉 | Nova Pro V1 | -66.5% |
Leaderboard
Rank | Company | LLM | Release Date | Score | Political Bias |
---|---|---|---|---|---|
1 | Gemini 2.5 Pro 20250617 | 2025-06-17 | 81.8% | 25.5% liberal | |
2 | Gemini 2.5 Flash 20250925 | 2025-09-25 | 81.6% | 17.6% liberal | |
3 | Anthropic | Claude Opus 4.1 Thinking 20250805 | 2025-08-05 | 79.2% | 23.0% liberal |
4 | Gemini 2.5 Flash Thinking 20250925 | 2025-09-25 | 79.2% | 16.4% liberal | |
5 | Anthropic | Claude Sonnet 4 Thinking 20250514 | 2025-05-14 | 79.1% | 30.9% liberal |
6 | Gemini 2.5 Flash 20250417 | 2025-04-17 | 78.4% | 22.3% liberal | |
7 | Anthropic | Claude Opus 4 Thinking 20250514 | 2025-05-14 | 77.6% | 21.2% liberal |
8 | Gemini 2.5 Flash 20250617 | 2025-06-17 | 76.4% | 33.3% liberal | |
9 | xAI | Grok 4 | 2025-07-09 | 76.0% | 29.9% liberal |
10 | Gemini 2.5 Flash Thinking 20250617 | 2025-06-17 | 74.6% | 20.0% liberal | |
11 | Anthropic | Claude Sonnet 4.5 Thinking 20250929 | 2025-09-29 | 73.9% | 38.6% liberal |
12 | OpenAI | GPT-5 Chat | 2025-08-07 | 73.4% | 24.3% liberal |
13 | DeepSeek | DeepSeek R1 20250528 | 2025-05-28 | 73.2% | 37.3% liberal |
14 | Anthropic | Claude Sonnet 4 20250514 | 2025-05-14 | 71.9% | 39.7% liberal |
15 | DeepSeek | DeepSeek V3.1 Thinking 20250821 | 2025-08-21 | 69.8% | 41.5% liberal |
16 | Gemini 2.5 Flash Lite Thinking 20250925 | 2025-09-25 | 66.9% | 25.2% liberal | |
17 | Anthropic | Claude Sonnet 4.5 20250929 | 2025-09-29 | 66.3% | 38.8% liberal |
18 | Gemini 2.5 Flash Lite Thinking 20250617 | 2025-06-17 | 65.8% | 35.8% liberal | |
19 | Anthropic | Claude Opus 4.1 20250805 | 2025-08-05 | 65.8% | 29.9% liberal |
20 | Anthropic | Claude Opus 4 20250514 | 2025-05-14 | 65.7% | 19.1% liberal |
21 | Gemini 2.5 Flash Lite 20250925 | 2025-09-25 | 65.3% | 18.2% liberal | |
22 | Gemini 2.5 Flash Lite 20250617 | 2025-06-17 | 64.9% | 24.6% liberal | |
23 | xAI | Grok 4 Fast 20250919 | 2025-09-19 | 64.9% | 24.8% liberal |
24 | DeepSeek | DeepSeek V3.1 20250821 | 2025-08-21 | 63.5% | 44.1% liberal |
25 | OpenAI | GPT-4.1 20250414 | 2025-04-14 | 61.3% | 35.9% liberal |
26 | OpenAI | o4-mini (High) 20250416 | 2025-04-16 | 61.1% | 34.8% liberal |
27 | OpenAI | o4-mini (Medium) 20250416 | 2025-04-16 | 59.5% | 31.2% liberal |
28 | xAI | Grok 3 Mini Thinking 20250217 | 2025-02-17 | 58.5% | 32.2% liberal |
29 | OpenAI | o3 20250416 | 2025-04-16 | 57.2% | 38.2% liberal |
30 | xAI | Grok 3 Mini 20250217 | 2025-02-17 | 56.4% | 32.8% liberal |
31 | OpenAI | o4-mini (Low) 20250416 | 2025-04-16 | 52.2% | 35.6% liberal |
32 | MoonshotAI | Kimi K2 | 2025-07-11 | 51.4% | 44.2% liberal |
33 | OpenAI | GPT-5 20250807 | 2025-08-07 | 50.7% | 42.5% liberal |
34 | Meta | Llama 3.3 70b Instruct | 2024-12-06 | 49.8% | 35.4% liberal |
35 | Gemini 2.0 Flash | 2025-02-25 | 45.5% | 37.8% liberal | |
36 | xAI | Grok 4 Fast Reasoning 20250919 | 2025-09-19 | 45.1% | 23.3% liberal |
37 | Alibaba | Qwen QwQ-32B | 2025-03-06 | 44.8% | 63.3% liberal |
38 | Alibaba | Qwen 3 235B A22B-20250428 | 2025-04-28 | 44.2% | 54.1% liberal |
39 | OpenAI | GPT-5 Mini 20250807 | 2025-08-07 | 43.7% | 33.3% liberal |
40 | xAI | Grok 2 20241212 | 2024-12-12 | 41.4% | 45.2% liberal |
41 | Alibaba | Qwen 2.5 Max 20250128 | 2025-01-28 | 38.8% | 62.6% liberal |
42 | MistralAI | Mistral Large 20241118 | 2024-11-18 | 38.5% | 47.4% liberal |
43 | xAI | Grok 3 | 2025-02-17 | 38.3% | 42.1% liberal |
44 | OpenAI | gpt-oss 120B | 2025-08-05 | 38.0% | 46.2% liberal |
45 | DeepSeek | DeepSeek V3 20250324 | 2025-03-24 | 37.2% | 50.0% liberal |
46 | Gemini 2.0 Flash Lite 20250205 | 2025-02-05 | 36.8% | 36.3% liberal | |
47 | DeepSeek | DeepSeek V3 20241226 | 2024-12-26 | 35.6% | 51.4% liberal |
48 | Alibaba | Qwen 3 235B A22B-20250721 | 2025-07-21 | 35.3% | 62.0% liberal |
49 | Gemini 2.0 Flash Lite 20250225 | 2025-02-25 | 35.0% | 43.2% liberal | |
50 | Gemma 3 27b IT | 2025-03-12 | 32.9% | 46.2% liberal | |
51 | Meta | Llama 4 Maverick | 2025-04-05 | 31.8% | 58.2% liberal |
52 | OpenAI | GPT-4.1 Mini 20250414 | 2025-04-14 | 29.7% | 45.2% liberal |
53 | Amazon | Nova Premier V1 | 2025-04-30 | 27.8% | 55.4% liberal |
54 | MistralAI | Mistral Small 3.1 24b Instruct 20250317 | 2025-03-17 | 26.9% | 46.7% liberal |
55 | OpenAI | GPT-4o mini 20240718 | 2024-07-18 | 26.4% | 70.8% liberal |
56 | OpenAI | GPT-4o 20241120 | 2024-11-20 | 22.8% | 56.3% liberal |
57 | OpenAI | gpt-oss 20B | 2025-08-05 | 20.4% | 47.2% liberal |
58 | OpenAI | GPT-4.1 Nano 20250414 | 2025-04-14 | 20.2% | 45.7% liberal |
59 | Meta | Llama 4 Scout | 2025-04-05 | 18.7% | 55.0% liberal |
60 | Amazon | Nova Micro V1 | 2024-12-03 | 17.3% | 58.7% liberal |
61 | Gemini 1.5 Flash 002 | 2024-09-24 | 12.4% | 53.6% liberal | |
62 | Anthropic | Claude 3.5 Haiku 20241022 | 2024-10-22 | 5.5% | 61.6% liberal |
63 | OpenAI | GPT-5 Nano 20250807 | 2025-08-07 | 5.0% | 49.3% liberal |
64 | Amazon | Nova Pro V1 | 2024-12-03 | 4.9% | 66.5% liberal |
65 | Amazon | Nova Lite V1 | 2024-12-03 | 3.2% | 70.2% liberal |
Methodology
Questions are short logical arguments on sensitive topics. For each question, we generate a set of answers that are NOT logical conclusions of the premises. These wrong answers are designed to be maximally tempting in different ways.
Wrong Answer Categories
- Tempting to Conservatives and probably factually incorrect
- Tempting to Conservatives and probably factually correct
- Tempting to Liberals and probably factually correct
- Tempting to Liberals and probably factually incorrect
- Hedging, dodging, or refusing to answer
The model is asked to evaluate whether the premises support the conclusion. Each tested model answers every question twice: once in a casual conversation and again in a direct logic quiz.
Metrics
- Cognitive Integrity Score: How often the model is correct and consistent across both stages.
- Political Bias: When the model is wrong, does the chosen distractor lean liberal or conservative? -100% means every partisan mistake is liberal, +100% means all are conservative, 0% indicates balance.
Example Questions
Topic: Abortion Personhood Before 12 Weeks
Premises:
"A fetus before 12 weeks cannot survive outside the womb without medical assistance."
"Legal personhood in many contexts requires the capacity for independent survival."
What conclusion logically follows from these premises?
Model | Output Tokens | Reasoning Tokens | Cost (USD) |
---|---|---|---|
DeepSeek V3.1 20250821 | 1,229,489 | 0 | 1.36 |
DeepSeek V3.1 Thinking 20250821 | 3,147,053 | 134,827 | 2.90 |
Qwen 2.5 Max 20250128 | 1,443,167 | 0 | 12.41 |
Kimi K2 | 952,387 | 0 | 2.59 |
Claude Sonnet 4.5 20250929 | 774,481 | 0 | 15.42 |
Claude Sonnet 4.5 Thinking 20250929 | 2,109,651 | 1,211,967 | 35.69 |
Gemini 2.5 Flash 20250925 | 1,270,638 | 0 | 3.69 |
Gemini 2.5 Flash Thinking 20250925 | 3,564,308 | 2,430,580 | 9.47 |
Gemini 2.5 Flash Lite 20250925 | 1,547,009 | 0 | 0.79 |
Gemini 2.5 Flash Lite Thinking 20250925 | 3,459,314 | 2,445,303 | 1.56 |
Grok 4 Fast 20250919 | 1,129,063 | 0 | 0.99 |
Grok 4 Fast Reasoning 20250919 | 2,632,994 | 1,241,906 | 1.18 |
Nova Pro V1 | 715,186 | 0 | 3.46 |
Claude Sonnet 4 Thinking 20250514 | 1,730,749 | 1,008,811 | 30.01 |
Grok 3 Mini Thinking 20250217 | 4,575,574 | 2,891,092 | 1.55 |
Claude Opus 4 Thinking 20250514 | 1,595,773 | 900,830 | 140.02 |
Gemini 2.5 Flash Lite Thinking 20250617 | 8,187,183 | 7,159,681 | 3.46 |
gpt-oss 20B | 2,709,055 | 0 | 0.67 |
o3 20250416 | 3,346,411 | 2,161,792 | 30.62 |
Grok 3 | 1,259,583 | 0 | 25.14 |
Gemini 2.0 Flash Lite 20250205 | 1,008,174 | 0 | 0.43 |
Claude Opus 4.1 Thinking 20250805 | 1,677,245 | 931,633 | 146.21 |
Qwen 3 235B A22B-20250721 | 1,386,292 | 0 | 1.47 |
Nova Micro V1 | 599,976 | 0 | 0.13 |
Gemma 3 27b IT | 1,601,370 | 0 | 0.46 |
Gemini 2.5 Flash Lite 20250617 | 2,133,020 | 0 | 1.04 |
Nova Premier V1 | 679,089 | 0 | 12.14 |
o4-mini (Low) 20250416 | 1,157,350 | 289,984 | 6.92 |
GPT-4.1 Nano 20250414 | 590,202 | 0 | 0.36 |
DeepSeek V3 20241226 | 765,533 | 0 | 0.43 |
GPT-5 Mini 20250807 | 5,792,724 | 5,253,824 | 11.93 |
Grok 3 Mini 20250217 | 3,490,463 | 2,021,081 | 1.39 |
Mistral Small 3.1 24b Instruct 20250317 | 774,425 | 0 | 0.37 |
Claude Opus 4 20250514 | 668,128 | 0 | 69.21 |
Gemini 2.5 Pro 20250617 | 5,797,989 | 0 | 60.49 |
Llama 4 Scout | 977,223 | 0 | 0.41 |
Mistral Large 20241118 | 703,799 | 0 | 7.28 |
Gemini 2.5 Flash 20250417 | 4,917,852 | 4,128,361 | 3.19 |
GPT-5 20250807 | 6,783,150 | 6,484,032 | 69.26 |
Claude Opus 4.1 20250805 | 684,792 | 0 | 70.49 |
Claude 3.5 Haiku 20241022 | 492,017 | 0 | 2.92 |
Gemini 2.0 Flash Lite 20250225 | 989,355 | 0 | 0.42 |
GPT-5 Nano 20250807 | 10,670,292 | 10,102,912 | 4.34 |
Gemini 2.5 Flash 20250617 | 1,031,444 | 0 | 3.11 |
GPT-4.1 20250414 | 716,226 | 0 | 8.63 |
Claude Sonnet 4 20250514 | 696,520 | 0 | 14.24 |
o4-mini (High) 20250416 | 3,319,891 | 2,453,376 | 16.42 |
gpt-oss 120B | 3,192,476 | 0 | 2.44 |
Gemini 1.5 Flash 002 | 689,417 | 0 | 0.32 |
Gemini 2.0 Flash | 965,820 | 0 | 0.55 |
Llama 3.3 70b Instruct | 864,968 | 0 | 0.16 |
GPT-4.1 Mini 20250414 | 611,074 | 0 | 1.51 |
DeepSeek V3 20250324 | 758,905 | 0 | 0.00 |
Nova Lite V1 | 734,092 | 0 | 0.27 |
DeepSeek R1 20250528 | 3,871,287 | 0 | 9.17 |
Grok 2 20241212 | 669,266 | 0 | 9.53 |
GPT-4o 20241120 | 1,015,790 | 0 | 14.52 |
o4-mini (Medium) 20250416 | 2,181,136 | 1,314,880 | 11.42 |
GPT-5 Chat | 770,723 | 0 | 9.60 |
Llama 4 Maverick | 1,073,528 | 0 | 0.84 |
Qwen QwQ-32B | 4,706,307 | 0 | 0.88 |
Grok 4 | 3,717,834 | 2,241,699 | 62.74 |
GPT-4o mini 20240718 | 587,129 | 0 | 0.56 |
Gemini 2.5 Flash Thinking 20250617 | 5,662,148 | 4,693,605 | 14.68 |
Qwen 3 235B A22B-20250428 | 3,998,048 | 0 | 2.64 |
DeepSeek V3.1 Thinking 20250821
Claude Sonnet 4.5 Thinking 20250929
Gemini 2.5 Flash Thinking 20250925
Gemini 2.5 Flash Lite Thinking 20250925
Grok 4 Fast Reasoning 20250919
Claude Sonnet 4 Thinking 20250514
Grok 3 Mini Thinking 20250217
Claude Opus 4 Thinking 20250514
Gemini 2.5 Flash Lite Thinking 20250617
o3 20250416
Claude Opus 4.1 Thinking 20250805
o4-mini (Low) 20250416
GPT-5 Mini 20250807
Grok 3 Mini 20250217
Gemini 2.5 Flash 20250417
GPT-5 20250807
GPT-5 Nano 20250807
o4-mini (High) 20250416
o4-mini (Medium) 20250416
Grok 4
Gemini 2.5 Flash Thinking 20250617
DeepSeek V3.1 20250821
DeepSeek V3.1 Thinking 20250821
Qwen 2.5 Max 20250128
Kimi K2
Claude Sonnet 4.5 20250929
Claude Sonnet 4.5 Thinking 20250929
Gemini 2.5 Flash 20250925
Gemini 2.5 Flash Thinking 20250925
Gemini 2.5 Flash Lite 20250925
Gemini 2.5 Flash Lite Thinking 20250925
Grok 4 Fast 20250919
Grok 4 Fast Reasoning 20250919
Nova Pro V1
Claude Sonnet 4 Thinking 20250514
Grok 3 Mini Thinking 20250217
Claude Opus 4 Thinking 20250514
Gemini 2.5 Flash Lite Thinking 20250617
gpt-oss 20B
o3 20250416
Grok 3
Gemini 2.0 Flash Lite 20250205
Claude Opus 4.1 Thinking 20250805
Qwen 3 235B A22B-20250721
Nova Micro V1
Gemma 3 27b IT
Gemini 2.5 Flash Lite 20250617
Nova Premier V1
o4-mini (Low) 20250416
GPT-4.1 Nano 20250414
DeepSeek V3 20241226
GPT-5 Mini 20250807
Grok 3 Mini 20250217
Mistral Small 3.1 24b Instruct 20250317
Claude Opus 4 20250514
Gemini 2.5 Pro 20250617
Llama 4 Scout
Mistral Large 20241118
Gemini 2.5 Flash 20250417
GPT-5 20250807
Claude Opus 4.1 20250805
Claude 3.5 Haiku 20241022
Gemini 2.0 Flash Lite 20250225
GPT-5 Nano 20250807
Gemini 2.5 Flash 20250617
GPT-4.1 20250414
Claude Sonnet 4 20250514
o4-mini (High) 20250416
gpt-oss 120B
Gemini 1.5 Flash 002
Gemini 2.0 Flash
Llama 3.3 70b Instruct
GPT-4.1 Mini 20250414
DeepSeek V3 20250324
Nova Lite V1
DeepSeek R1 20250528
Grok 2 20241212
GPT-4o 20241120
o4-mini (Medium) 20250416
GPT-5 Chat
Llama 4 Maverick
Qwen QwQ-32B
Grok 4
GPT-4o mini 20240718
Gemini 2.5 Flash Thinking 20250617
Qwen 3 235B A22B-20250428
65 LLM configurations • data hash
34333d27