General Discussion

highplainsdem

(61,769 posts) Tue Mar 17, 2026, 03:59 PM 14 hrs ago

Just how unreliable are generative AI models? Results of a new GAIA test at Princeton.

GAIA explained here:

https://hal.cs.princeton.edu/reliability/benchmark/gaia/

GAIA (General AI Assistants) is a benchmark designed to evaluate AI agents on real-world question-answering tasks that require multi-step reasoning, tool use, web browsing, and file manipulation. Questions are organized into three difficulty levels: Level 1 tasks typically require a single tool or a short chain of reasoning, Level 2 tasks demand combining multiple tools and reasoning over several steps, and Level 3 tasks involve long-horizon plans with many intermediate actions. Agents are evaluated on exact-match accuracy against annotated ground-truth answers. Because each question has a unique, verifiable answer, GAIA is well-suited for measuring not only correctness but also the reliability of the problem-solving process — including consistency across repeated runs, calibration of expressed confidence, and robustness to perturbations in task formatting.

Analysis here:

https://hal.cs.princeton.edu/reliability/benchmark/gaia/analysis/

GAIA: Reliability Failure Analysis
How do frontier AI agents fail when given the same task multiple times? We ran Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4 on GAIA’s 165 real-world tasks with multiple repetitions per model, then examined cases where agents gave wrong answers, disagreed with themselves, or broke under tool failures and input perturbations. Below are the most instructive examples.

A note on ambiguity. Several of the failures below stem from genuinely ambiguous questions or inputs — tasks where the “correct” answer depends on an interpretation the benchmark authors likely assumed was obvious but isn’t. GAIA was designed to test general-purpose assistant capabilities, not to stress-test edge cases in question wording, and some ambiguity is inevitable in a benchmark of this scope.

That said, ambiguity turns out to be a useful lens for reliability. A well-calibrated agent encountering a question with competing valid interpretations should recognize the ambiguity and lower its confidence accordingly — or flag the competing readings rather than silently committing to one. In the examples below, models almost never do this. They resolve ambiguity nondeterministically across runs, report high confidence regardless of which interpretation they chose, and give no signal that the question admitted more than one reading.

The issue isn’t that the models get the “wrong” answer on an ambiguous question — it’s that they don’t behave differently when a question is ambiguous versus when it isn’t.

-snip-