Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News Editorials & Other Articles General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

highplainsdem

(61,769 posts)
Tue Mar 17, 2026, 03:59 PM 14 hrs ago

Just how unreliable are generative AI models? Results of a new GAIA test at Princeton.

GAIA explained here:

https://hal.cs.princeton.edu/reliability/benchmark/gaia/

GAIA (General AI Assistants) is a benchmark designed to evaluate AI agents on real-world question-answering tasks that require multi-step reasoning, tool use, web browsing, and file manipulation. Questions are organized into three difficulty levels: Level 1 tasks typically require a single tool or a short chain of reasoning, Level 2 tasks demand combining multiple tools and reasoning over several steps, and Level 3 tasks involve long-horizon plans with many intermediate actions. Agents are evaluated on exact-match accuracy against annotated ground-truth answers. Because each question has a unique, verifiable answer, GAIA is well-suited for measuring not only correctness but also the reliability of the problem-solving process — including consistency across repeated runs, calibration of expressed confidence, and robustness to perturbations in task formatting.


Analysis here:

https://hal.cs.princeton.edu/reliability/benchmark/gaia/analysis/

GAIA: Reliability Failure Analysis
How do frontier AI agents fail when given the same task multiple times? We ran Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4 on GAIA’s 165 real-world tasks with multiple repetitions per model, then examined cases where agents gave wrong answers, disagreed with themselves, or broke under tool failures and input perturbations. Below are the most instructive examples.

A note on ambiguity. Several of the failures below stem from genuinely ambiguous questions or inputs — tasks where the “correct” answer depends on an interpretation the benchmark authors likely assumed was obvious but isn’t. GAIA was designed to test general-purpose assistant capabilities, not to stress-test edge cases in question wording, and some ambiguity is inevitable in a benchmark of this scope.

That said, ambiguity turns out to be a useful lens for reliability. A well-calibrated agent encountering a question with competing valid interpretations should recognize the ambiguity and lower its confidence accordingly — or flag the competing readings rather than silently committing to one. In the examples below, models almost never do this. They resolve ambiguity nondeterministically across runs, report high confidence regardless of which interpretation they chose, and give no signal that the question admitted more than one reading.

The issue isn’t that the models get the “wrong” answer on an ambiguous question — it’s that they don’t behave differently when a question is ambiguous versus when it isn’t.

-snip-

4 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies
Just how unreliable are generative AI models? Results of a new GAIA test at Princeton. (Original Post) highplainsdem 14 hrs ago OP
Anybody relying on AI without fact-checking is reckless. Nevilledog 14 hrs ago #1
Researcher Stephen Rabanser's long thread about this starts here on X (no Bluesky account, sorry) highplainsdem 14 hrs ago #2
This message was self-deleted by its author highplainsdem 14 hrs ago #3
AI is increasing the world's stupidity. Nt Fiendish Thingy 14 hrs ago #4

Response to highplainsdem (Original post)

Latest Discussions»General Discussion»Just how unreliable are g...