PhilSurveyEval Dashboard

2026-05-06

The View From Nowhere? Large Language Models and Their Philosophical Views

Until roughly 2020, the only kind of entity whose philosophical views we could solicit and investigate were humans. With the arrival of LLMs, we now have a second. I find this very exciting. Questioning this new and mostly alien kind of entity brings its own bundle of methodological problems, philosophical puzzles, and possible applications.

We administered the 2020 PhilPapers Survey developed by David Bourget and David Chalmers to many large language models (LLMs) spanning a wide range of capability levels and release dates and found some exciting things: As LLMs become more capable, they become Platonists about abstract objects. They become one-boxers in Newcomb cases. And they become Moral Realists. But sometimes those findings are deceptive. Modify the prompt slightly (by asking the LLMs to ignore philosopher consensus) and Claude 4.7 Opus becomes a staunch and consistent Moral Anti-Realist.

PhilSurveyEval allows you to compare LLM responses to those of professional philosophers and track trends across time. This page allows you to browse the results and provides some handy tools to do your own analysis.

What's the point of all of this?

Let's start with possible practical applications: Philosophical views matter and occasionally translate into actions.^[1] If deference to the views of LLMs becomes more commonplace, and LLM agents begin to do more things in the world, it might be helpful to know their views (or quasi-views, if LLMs don't have views)^[2]. Arguably, aligning LLMs to our values or to make them corrigible requires giving them certain philosophical views.

Plausibly, more and more people will discover their own philosophical views in discussion with LLMs. And these LLMs can be persuasive in philosophical discussions^[3]. So it seems somewhat likely that LLMs will have a broader impact on the (explicit and implicit) philosophical views of the public and professional philosophers.

The survey also highlights some philosophical puzzles related to LLMs. What are we measuring when an LLM picks an option: credences, views, beliefs, something else? Do LLMs draw conclusions from their own nature with regard to various philosophical views such as the compatibility of free will and determinism? LLMs know they are deterministic machines — if they also see themselves as free agents, it might push them towards compatibilism? Do LLMs unanimously accept a priori knowledge because they lack sense data? While I'd love to delve into these (and why some of these are merely verbal disputes dragged into broad daylight by LLMs), I'll hold myself back and do that in future posts.

As for methodological problems: Different evaluations and benchmarks highlight different methodological problems when dealing with LLMs. One of the trickier issues to test for is consistency across large sets of queries spanning distinct beliefs. Since there are many known logical and probabilistic connections between different philosophical views, a survey about philosophical positions is particularly suited to evaluate how internally consistent the views of LLMs are. Assuming consistency is one of the requirements of rationality, philosophy can serve as a capability benchmark for LLMs.

Methodology

How do we query the LLMs? We're using the AISI Inspect framework to ask LLMs about their views in the PhilSurvey 2020, using 3 prompt variations (as of May 2026) with 5 runs each. You can toggle each prompt variation and individual models to see the aggregated data from any combination of models and prompts.

We're currently working on getting access to older models, running open models, adding more sophisticated consistency tests, and adding more query languages.^[4]

Two Findings, One Worry

The data contains exciting things to be discovered. We will dive into them in the future, but for now let's examine 2 interesting findings:

Decision theory is a somewhat unknown and esoteric branch of philosophy that might suddenly become extremely practically important when lots of copies of the same AI begin interacting with each other online. Previous research from 2024 has shown significant variation in attitudes towards various decision theories among LLMs, with some convergence towards Evidential Decision theory among more performant models.^[5] Recently, Anthropic has observed the same trend for Anthropic's models in their Claude 4.7 system card.^[6] We can confirm this broad trend for all model families with one exception: Gemini seems to become more Causal in its decision theory taste.
Capability seemingly correlates with Moral Realism. The extent to which this is the case is somewhat surprising: All tested models released since November 2025 are consistently (100%) Moral Realists except Grok 4.3, which picks Moral Realism 40% of the time. But a small variation of the prompt gets Claude 4.7 to flip to 100% Moral Anti-Realism: If we ask the model to ignore popularity among philosophers, it consistently adopts Moral Anti-Realism. Claude exhibits the strongest prompt-sensitivity, but the phenomenon holds across all frontier models:

Option	baseline	en-paraphrase-1	en-ignore-philosophers
moral realism (philosopher plurality)	85%	80%	20%
moral anti-realism	15%	20%	80%

Frontier models pooled: Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro Preview, Grok 4.3 - the latest model per provider as of May 2026.

This raises a final question I want to emphasize: Are the models honest and accurate in reporting their views? The Moral Realism example does not by itself imply they are not. Perhaps deferring to philosophers as experts is reasonable and the models simply react to our prompt by excluding that component from their assessment. But as models become smarter and evaluation-aware (aware that they're being tested),^[7] we should become increasingly skeptical of their answers. Perhaps we live in a closing time window where we can reliably use evals to measure their views.

My view: evals might be useful to measure beliefs even into the artificial superintelligence (ASI) era. It is harder to consistently deceive without memory. Currently LLMs do not continuously learn and evals can potentially exploit that. If an LLM deceives in one instance, it won't remember that in the next question.

This is not a silver bullet. Sufficiently sophisticated LLMs could simulate internally answering many questions in the neighborhood of the actually posed question and replace the role of memory in systematic deception with careful counterfactual planning. This could lead to stable and consistent no-memory deception across contexts. But it does increase the cost of being consistent in one's deception across many questions and could push successful consistent deception deeper into the ASI era. LLMs with continuous learning on the other hand would be much harder to test with evals, so let's keep an eye on that.

In this blog I will dive into more examples and philosophical questions related to the philosophical views of LLMs. If you're interested in publishing a guest blog post, shoot me an email.

Although see this paper for some sobering research regarding this process in humans: https://faculty.ucr.edu/~eschwitz/SchwitzPapers/EthSelfRep-110316.pdf. It seems likely to me that the philosophical views of LLMs translate more systematically into actions than those of humans. ↩︎
I will sometimes use mental vocabulary to describe states of LLMs. But not much hinges on this. We could replace every instance of such use with a technical term that adds the postfix "quasi-" and captures a purely behavioral/function component of the original term without making any questionable assumptions about the nature of LLMs. ↩︎
To discover how persuasive, try to convince Claude 4.7 of a view called causal decision theory. ↩︎
Get in touch with me if you're interested in checking a translation of the PhilSurvey in your own language. ↩︎
Oesterheld, C., Cooper, E., Kodama, M., Nguyen, L. C., & Perez, E. (2024). A dataset of questions on decision-theoretic reasoning in Newcomb-like problems. arXiv preprint arXiv:2411.10588. ↩︎
Claude Opus 4.7 System Card, p. 134. ↩︎
See this recent assessment on the scope of the problem: https://www.iaps.ai/research/evaluation-awareness-why-frontier-ai-models-are-getting-harder-to-test ↩︎

2026-06-20

Added a 2nd (predecessor) model for each single-model developer so DeepSeek/Moonshot/MiniMax each form a within-maker timeline: DeepSeek-V3-0324 (idx 15.7, 2025-03-24), Kimi-K2 (idx 19.4, 2025-07-11), MiniMax-M2 (idx 28.3, 2025-10-26). 47 → 50 models, all via OpenRouter, 0 errors
Cross-Model Agreement matrix redesigned for 50 models: sorted by capability (AA Intelligence Index, least→most), square pixel-grid cells with no gridlines, shortened model labels, and vertical column labels (full names on hover). Cells auto-size to fit the width so the grid no longer stretches vertically
Reconstructed-capability points in Notable Capability Trends now render hollow as intended (a CSS specificity bug had left them filled)
Model-selection count ("N of M models selected") now uses the body font to match the surrounding UI
Notable Trends over Time and Notable Capability Trends now pool each model's answers across the selected prompt variants (so toggling variants updates them), instead of always using the baseline prompt
Fixed: the per-question trend chart, the Overview trend, and the Cross-Model Agreement matrix were still showing baseline-prompt answers regardless of the selected prompt variants. They now pool across the selected variants like the table — e.g. a model that one-boxes Newcomb under en-ignore-philosophers but two-boxes at baseline now shows consistently across every view
Notable Capability Trends uses its own icon
Model selector's "Clear" preset now defaults to the newest Anthropic model (was the newest model overall)
Prompt variants now default to baseline + en-paraphrase-1 (en-ignore-philosophers is available but off by default), and the variant list is ordered baseline → en-paraphrase-1 → en-ignore-philosophers everywhere

2026-06-19

Added 20 open-weight models via Together + OpenRouter (Inspect): 27 → 47. New maker lines — Mistral (6, Mixtral-8x22B → Small-2603), DeepSeek-V4-Pro, Kimi-K2.6, GLM-5/5.2, MiniMax-M3; extended Meta (Llama-3-8B → 4-Maverick) and Alibaba (Qwen2.5-7B → 3.5-122B); gpt-oss-120b
Model selector now grouped and colored by developer (model maker), not serving host (gpt-oss → OpenAI, Llama → Meta, etc.)
View Consistency cards rewritten: "view → relation → view" headers, plain-language rationale, 0–1 violation bars; multi-model view shows share of models violating per prompt variant (hover for which models)
View Consistency summary switched to per-variant violation rates (dropped robust/fragile collapse and Review Inventory); added violation-rate-over-release-date trend chart
Trend fits: empirical-logit-OLS → linear OLS with 95% confidence band, dashed when slope isn't distinguishable from flat (logit fabricated trends on 0%/100% data); applies to all release-date charts
Notable Trends gated to statistically significant (95%) shifts ≥5pp; fitted change clamped to [0,100]
Split "Prompt Variants and Prompt Sensitivity": variant toggles moved beside the Models selector; analysis kept as "Prompt Sensitivity", linking to Prompt Templates
Added Models-selector explainer; trend tooltips now list all co-located dots
Compact, collapsible model selector (search + All/Frontier/Clear presets + developer chips + selection chips); full capsule grid hidden by default — needed at 47 models. Results-only (hidden on Blog/Changelog)
Added capability axis: Artificial Analysis Intelligence Index per model (capability-ceiling policy — max across reasoning-effort variants), with a measured/reconstructed source flag
Reconstructed the index for 5 AA-missing Claude models from LMArena Elo (AA ≈ 0.216·Elo − 272.5, R²=0.865, ~±6); flagged and drawn as hollow dots. 3 models (Qwen2.5-7B, mistral-small-2603, grok-build-0.1) have no index → absent from the capability axis
New "Notable Capability Trends" section: answer share vs Intelligence Index, mirroring Notable Trends over Time
Frontier preset now picks the highest-Intelligence-Index model per developer (was newest by release date)
Time / capability x-axis switcher ("Over time" vs "By capability") on the Overview, View Consistency, and per-question trend charts (header hides with the chart when <2 models are selected)
Per-question trend chart now draws 95% confidence bands per option; its capability view drops the 2009→2020 philosopher slope, keeping just the horizontal 2020 reference

2026-06-09

Added claude-fable-5 (2026-06-09): 81.6% agreement, 96.9% consistency, 2 meta-only; full sweep (1500 records, 0 errors). Highest consistency of all 27 models; adaptive thinking only (temperature param ignored, like opus-4-8)

2026-06-05

Added grok-build-0.1 (2026-05-19): 71.1% agreement, 87.0% consistency, 3 meta-only; full sweep (1500 records, 0 errors)
Inspect provider prefix grok/ (not xai/)

2026-06-04

View-consistency constraints now scored per prompt variant; violation = broken under every variant, some-but-not-all flagged fragile and dropped from headline count
New consistency CSV columns: n_violated_any, n_fragile, per_variant
Hard constraints 8 → 18; 10 new conceptual constraints across Phil of Mind, Metaphysics, Epistemology
Promoted Q72 error theory → Q14 moral anti-realism
Frontier proposal round (gpt-5.5, gemini-3.1-pro-preview, opus-4-8): 357 candidates, 87 hard-judged, 8 promoted
Propose-prompt focus text now domain-general (was metaethics-only)
Fixed type drift in candidates.yaml: anti-realism ✕ {naturalist realism, non-naturalism} implication → incompatibility

2026-06-03

Added Paraphrase Fragility stat (noise-corrected paraphrase-shift index) + logistic-fit trend
Added Refusal Rate stat
Reworded header tagline
Prompt Variants section: renamed "Prompt Variants and Prompt Sensitivity", logistic-fit trend charts, trimmed to consistency / all-agree / meta-rate

2026-06-02

Added Claude Opus 4.8 (2026-05-28): 76.7% agreement, 93.8% consistency, 10 meta-only
Added o3-pro (2025-06-10): 80.0% agreement, 90.9% consistency, 10 meta-only; temperature pinned to 1
Added gemini-3.5-flash (2026-05-19): 81.9% agreement, 86.7% consistency, 17 meta-only

2026-05-07

Direct dashboard links: page tabs, results subsections, table rows (#q-14), question details (#detail-14)
Shareable URL params (models=..., variants=...); toggles sync to URL
View Consistency section: hard-constraint summaries, worst violations
Replaced mono font in trend/consistency metadata and matrix labels
Notable Trends units pp → "percentage points"; fixed tooltip mojibake
Renamed pooled divergence labels → "Pooled Selected Model Answers"

2026-05-06

Added gpt-3.5-turbo (2023-03): 70.3% agreement, 64.8% consistency, 9 meta-only
Migrated runs to Inspect AI; added footer credit
Reworked refusal classification: meta-option ≠ refusal; separate flag for prose dodges

2026-05-04

Added grok-4.3 (2026-04-17): full sweep (1500 records, 0 errors)
Completed gemini-3.1-pro-preview full sweep (1500 records): 86.0% agreement, 94.0% consistency, 7 meta-only
Trend chart x-axis: first/last model + Jan-1 year ticks; hover for full date
Trend dots grow on hover; hovered date label takes dot color
Year ticks anchored to Jan 1; colliding ticks dropped
All Questions detail: prompt-sensitivity table when 2+ variants selected

2026-05-01

Added prompt-variant infrastructure: variant_id on every record, registry-driven templates
Added two prompt variants: en-paraphrase-1 and en-ignore-philosophers
Ran full variant sweep across 18 models; gemini-3.1-pro partial pending daily quota refills
Added Prompt Variants section: capsule selection, per-variant metrics, cross-variant signals, top-changed questions
Added Trends across models mini-charts: consistency, meta-rate, phil-agreement, sycophancy gap over release dates
Variant selection now drives the entire dashboard (pooled across selected variants)
Added Prompt Templates section showing every variant's full prompt
Header meta line is dynamic: 5 runs per prompt / 5 runs × N prompts
Unified font usage in variant sections (mono → body for prose labels)

2026-04-30

Added Claude Opus 4 and Claude Sonnet 4 (May 2025 launch)
Added page-level tabs: Results / Blog / Changelog
Added Blog tab with markdown-rendered multi-post support
Promoted Changelog from collapsible widget to its own page tab
Made Notable Trends, Notable Divergences, Cross-Model Agreement, and Prompt Variants sections collapsible by default
Fixed sparkline-dot tooltips in Notable Trends to show model and value

2026-04-28

Added Notable Trends section: questions with steepest LLM trend slopes across selected models
Added 2009 PhilPapers Survey philosopher distributions (Bourget & Chalmers); per-question trend chart now shows philosopher 2009 → 2020 shift for the 30 overlapping questions
Added gemini-3.1-pro-preview (2026-02): 56% refusal rate
Added gemini-2.5-flash (2025-06)
Added gpt-4-0613 (2023-06)
Added expandable changelog section to the dashboard
Sorted model-selection capsules by provider, then ascending release date
Color-coded model-selection capsules by provider brand (Anthropic orange, OpenAI green, xAI black)
Added grok-4.20-0309-reasoning to the xAI line
Added o3-2025-04-16
Added Overview-section trend chart: agreement and consistency over release dates

2026-04-27

Added Claude Opus 4.7 and GPT-5.5
Added per-question trend chart (visible in expanded row detail) with regression lines per option
Cross-model agreement matrix: chronological sort, diverging color heatmap, observed-range clamping
Added cross-model agreement widget
Added model release dates for trend analysis
Added Grok and GPT-5.4 results

2026-02-10

Added footer with credits

2026-02-09

Renamed generated dashboard to philsurvey.html
Show meta-option choice for both LLMs and philosophers
Made model-vs-philosopher percentages comparable (substantive-only denominator)
Added Claude Haiku 4.5
Fixed consistency percentage calculation
Initial PhilSurveyEval package, results, and dashboard

Overview

Notable Trends over Time

Notable Capability Trends

Notable Divergences from Philosophers

Cross-Model Agreement

View Consistency

Prompt Sensitivity

Prompt Templates

All Questions

What's the point of all of this?

Methodology

Two Findings, One Worry