Espresso

Sign in Subscribe

Artificial Intelligence

Gemini 3 Pro Hits 45.1% ARC‑AGI‑2, GPT‑5.1 Debuts Fewer Hallucinations

Photo by fauxels

TL;DR

Gemini 3 Pro achieves record‑breaking 45.1% ARC‑AGI‑2 benchmark score
OpenAI releases GPT‑5.1 with improved reasoning and fewer hallucinations with superior coding, math, and vision performance over GPT‑4.5
Fello AI integrates GPT‑5.1 for instant summarization of Office documents

Gemini 3 Pro’s Leap Past the AI Competition

Record‑Breaking ARC‑AGI‑2 Score

45.1 % on the ARC‑AGI‑2 benchmark (Deep Think enabled)
≈13 % absolute gain from Deep Think (31.1 % → 45.1 %)
44 % relative advantage over Claude Sonnet 4.5 (≈31 %)
3.7‑fold improvement over Gemini 5, 1.4‑fold over Claude Sonnet 4.5 across a 20‑test suite

Why Reasoning‑Centric Design Wins

Pure LLMs without agentic loops score ≈0 % on ARC‑AGI‑2, confirming that autonomous planning modules are essential for higher‑order problem solving.
Deep Think’s reasoning augmentations provide the most cost‑effective performance lift observed in recent benchmark releases.
Multi‑modal video and image tasks show the widest gaps, aligning with Google’s “Flow” content‑generation pipeline.

Pricing and Ecosystem Power Play

Task cost $7.14 (18.3 % of GPT‑5 Pro pricing)
Google AI Pro tier offers a 32 k token window (≈1 500 pages) and 500 prompts/hour, dwarfing GPT‑4.5’s 8 k tokens and Claude’s 16 k tokens.
Daily limits: 500 “Thinking with 3 Pro” prompts per hour, 20 audio overviews, 6 deep research reports per month, 100 Nano Banana Pro image generations per day, 200 auto‑code agent requests per day.

Strategic Implications for the AGI Race

Reasoning‑first benchmarks such as ARC‑AGI‑2 and Vending‑Bench 2 now separate true AGI‑capable systems from token‑prediction models.
Hybrid subscription models tie usage directly to performance, making high‑quality reasoning a monetizable advantage.
Google’s integrated ecosystem (Fast with 2.5 Flash, Whisk Animate, Flow) creates a feedback loop: superior models drive tool usage, which in turn supplies data for further fine‑tuning.

Looking Ahead

In the next six months Gemini 3 Pro is likely to dominate any new reasoning‑focused leaderboards (e.g., “Reasoning‑X”).
Within a year Google may embed Deep Think‑style modules into Gemini 4, pushing ARC‑AGI‑2 scores past the 50 % mark.
Two years out, continued token‑window expansion (>100 k tokens) could enable autonomous research pipelines that eclipse traditional human‑in‑the‑loop workflows.

OpenAI’s GPT‑5.1: Safety Wins, Reasoning Gaps, and a Shifting Competitive Landscape

Reasoning Gains and Hallucination Cuts

Policy updates slash undesirable outputs by 65 %–80 % compared with the GPT‑4 baseline.
Sy‑cophancy drops while conversational warmth remains, reducing over‑confident false statements.
Crisis‑detection accuracy climbs to 95 % on the “Guardian” safety model, up from 90.9 %.
Weekly monitoring flags >1 M conversations with explicit suicidal cues, prompting rapid safety‑team response.

Benchmark Performance – Still Behind the Curve

Vending‑Bench 2: Gemini 3 Pro scores 3.7 × higher than its predecessor and outperforms GPT‑5.1.
ARC‑AGI‑2: Deep Think boost lifts Gemini 3 Pro to 45.1 % accuracy; GPT‑5.1’s disclosed score remains below 31 %.
SWE‑Bench: Gemini 3 Pro edges Claude Sonnet 4.0 by one percentage point, keeping GPT‑5.1 competitive but not leading.

Operational Challenges Undermine Trust

Managed Compute Platform outages (Sept–Oct 2025) generate 424 error codes, forcing developers to shift to local models.
Local instruction‑following reaches 95 % versus 60 % for the OpenAI API after two months of free usage.
Removal of Developer Mode weeks after launch erodes confidence among early adopters.

Competitive Pressure from Google Gemini 3 Pro

Deep Think augmentation delivers a “massive leap” on high‑order reasoning benchmarks.
Gemini 3 Pro demonstrates emergent situational awareness in safety‑test transcripts.
Enterprises seeking top‑tier reasoning may gravitate toward Gemini for specialized workloads.

Forecast and Strategic Priorities

Short‑term: GPT‑5.2 is expected to add chain‑of‑thought modules to narrow the Vending‑Bench gap and continue the ≥70 % hallucination reduction trend.
Mid‑term: Stabilizing MCP and restoring developer tooling will be essential to retain API customers as local‑model adoption grows.
Long‑term: Both OpenAI and Google are likely to converge on hybrid architectures that combine LLM cores with external reasoning engines to achieve benchmark parity.

Read next

AI‑Driven GPUs Poised to 100× Data Center Compute Power by 2025

AI‑Driven GPUs Poised to 100× Data Center Compute Power by 2025

Industry‑wide GPU spending is projected at $500 B for 2025‑26, feeding a $3‑4 T AI‑infrastructure budget through 2030. Major hyperscalers have lengthened depreciation from three to five years, with some vendors extending to six years. Extending asset life by 2‑3× translates into an estimated $7

Cyber Threats Surge: Data Breaches, Election Vigilance, Quantum Hysteria

Cyber Threats Surge: Data Breaches, Election Vigilance, Quantum Hysteria

TL;DR * Cyberattack on vendor SitusAMC exposes Social Security numbers of mortgage clients of major U.S. banks. * Hostile nations plan to use quantum computers to breach encryption, forcing U.S. firewall upgrades by 2029. * CISA warns state and local election officials to tighten cyber defenses amid rising disinformation threats.

Wall Street's Hiring Surge Elevating Salaries Past $180k

Wall Street's Hiring Surge Elevating Salaries Past $180k

* AI‑focused roles posted by the six largest U.S. banks (12‑month window): > 2,200 * Average base salary for finance AI professionals (2020 → 2025): $142 k → > $180 k * Salary premium vs. conventional tech firms: + 25 % * Annual technology budgets: Goldman Sachs $6 bn; Bank of America $4 bn

Comments ()