Gemini 3.1 Pro: Google DeepMind's New Model Doubles ARC-AGI Score with 1M Context Window
Google DeepMind's Gemini 3.1 Pro scored 77.1% on ARC-AGI-2, more than double its predecessor. The new model features a 1M token context window and leads on 13 of 16 benchmarks at unchanged pricing.
Gemini 3.1 Pro: Google DeepMind's New Model Doubles ARC-AGI Score with 1M Context Window
Gemini 3.1 Pro: Google DeepMind's New Model Doubles ARC-AGI Score with 1M Context Window
Google DeepMind has released Gemini 3.1 Pro, and it puts Google back at the top of the AI benchmark charts. The model scored 77.1% on ARC-AGI-2, a test of pure logic and novel problem-solving, more than doubling its predecessor's score.
Released on February 19, 2026, Gemini 3.1 Pro leads on 13 of 16 major benchmarks while keeping pricing identical to Gemini 3 Pro. For developers and enterprises building agentic systems, this is now the strongest general-purpose model available.
The Benchmark Breakthrough
The headline number is 77.1% on ARC-AGI-2. This benchmark matters because models cannot memorize their way through it. The test requires genuine reasoning on novel problems, making it one of the hardest evaluations in AI.
Gemini 3 Pro scored around 35% on ARC-AGI-2. Gemini 3.1 Pro more than doubled that in a single generation.
On GPQA Diamond, which tests expert-level scientific knowledge across biology, physics, and chemistry, Gemini 3.1 Pro hit 94.3%. That puts it ahead of both Claude Opus 4.6 and GPT-5.2 on scientific reasoning.
The 1M Token Context Window
Gemini 3.1 Pro ships with a 1 million token context window. This allows the model to process roughly 750,000 words in a single conversation, equivalent to about 10 full-length novels.
For enterprises, this means:
- •Large codebases: Analyze entire repositories without chunking
- •Legal documents: Process complete contracts and case files
- •Research papers: Synthesize hundreds of pages of technical content
- •Customer data: Maintain full conversation history for personalized agents
The 1M context window matches what Anthropic offers with Claude Opus 4.6 and Sonnet 4.6, but at Google's Pro-tier pricing.
Pricing That Undercuts Competitors
Google kept Gemini 3.1 Pro pricing identical to Gemini 3 Pro:
| Model | Input Cost | Output Cost |
|---|---|---|
| Gemini 3.1 Pro | $2.00/million tokens | $8.00/million tokens |
| Claude Opus 4.6 | $15/million tokens | $75/million tokens |
| Claude Sonnet 4.6 | $3/million tokens | $15/million tokens |
| GPT-5.3 Codex | ~$1.75/million tokens | ~$10/million tokens |
For an enterprise processing 10 million tokens per day, Gemini 3.1 Pro costs $20/day in input tokens. Claude Opus 4.6 would cost $150/day for the same volume.
The value proposition is clear: frontier-level performance at a fraction of flagship pricing.
What Makes Gemini 3.1 Pro Different
Multimodal Native
Gemini 3.1 Pro was built from the ground up to handle text, images, audio, video, and code in a single model. Unlike models that bolt vision capabilities onto a text-only foundation, Gemini processes all modalities through unified architecture.
This matters for:
- •Video analysis: Extract insights from meeting recordings, security footage, or content
- •Document processing: Handle PDFs with embedded charts, diagrams, and images
- •Code with context: Analyze repositories with documentation, screenshots, and diagrams
Agentic Capabilities
On multi-step reasoning and long-horizon planning tasks, Gemini 3.1 Pro shows significant improvements over its predecessor. The model maintains coherence across extended agent workflows, reducing the failure rate that plagues multi-turn deployments.
Early testing shows Gemini 3.1 Pro excels at:
- •Multi-step research synthesis
- •Complex code generation with testing and iteration
- •Long-running data analysis pipelines
- •Autonomous task completion with tool use
Access Via Multiple Platforms
Gemini 3.1 Pro is available through:
- •Gemini API: Direct developer access
- •Vertex AI: Google Cloud enterprise platform
- •Google Antigravity: Google's internal AI infrastructure
- •Gemini Advanced: Consumer subscription at ~$18.99/month
Competitive Landscape
Gemini 3.1 Pro enters a crowded February 2026 release window that included:
- •Claude Opus 4.6 (Feb 4): Anthropic's flagship, 68.8% ARC-AGI-2
- •Claude Sonnet 4.6 (Feb 17): Near-Opus performance at lower cost
- •GPT-5.3 Codex (Feb 5): OpenAI's coding specialist
- •Grok 4.20 (Feb 17): xAI's multi-agent architecture
- •Qwen 3.5 (Feb 2026): Alibaba's open-weight contender
On raw benchmark breadth, Gemini 3.1 Pro leads the pack. On the GDPval-AA human preference benchmark, which measures real expert-level office work, Sonnet 4.6 still holds the top spot at 1,633 Elo points.
The trade-off: Gemini 3.1 Pro wins on logic and scientific reasoning benchmarks. Claude Sonnet 4.6 wins on human preference for complex professional work. Both are viable choices depending on your use case.
What This Means for Developers
If you're building agentic systems, Gemini 3.1 Pro is worth serious consideration:
1. Cost efficiency: At $2/M input tokens, you can run more iterations for the same budget
2. Context capacity: 1M tokens means less engineering around context limits
3. Multimodal: Single model handles text, images, audio, and video
4. Benchmark leadership: 13 of 16 benchmarks suggests consistent quality
The main consideration is human preference. If your users prioritize Claude's writing style and reasoning approach, Sonnet 4.6 at $3/M input remains competitive. If you need maximum benchmark performance for logic and reasoning tasks, Gemini 3.1 Pro has the edge.
The Bottom Line
Google DeepMind needed a win after months of relative quiet in the model release cycle. Gemini 3.1 Pro delivers.
The combination of 77.1% ARC-AGI-2, 1M context window, and unchanged pricing makes this the best value proposition for developers building agentic systems. Whether it displaces Claude in professional workflows depends on your specific use case, but the benchmark numbers speak for themselves.
For enterprises evaluating AI models in Q1 2026, Gemini 3.1 Pro belongs on your shortlist.
Sources:
Share this article
About NeuralStackly Team
Expert researcher and writer at NeuralStackly, dedicated to finding the best AI tools to boost productivity and business growth.
View all postsRelated Articles
Continue reading with these related posts
Claude Sonnet 4.6: Anthropic's Mid-Tier Model Now Matches Flagship Opus at One-Fifth the Cost
Claude Sonnet 4.6: Anthropic's Mid-Tier Model Now Matches Flagship Opus at One-Fifth the Cost
Anthropic's Claude Sonnet 4.6 delivers near-Opus performance across coding, computer use, and agentic tasks while costing 80% less. The new default model features a 1M token con...
GPT-5.4 Beats Human Professionals 83% of the Time in Real-World Tasks
GPT-5.4 Beats Human Professionals 83% of the Time in Real-World Tasks
OpenAI's GPT-5.4 matches or exceeds human professionals in 83% of professional tasks across 44 occupations, from financial analysis to legal work. The model reduces errors by 18...
Gemini 3.1 Flash Lite: Google's Fastest Model at 1/8th the Cost of Pro
Gemini 3.1 Flash Lite: Google's Fastest Model at 1/8th the Cost of Pro
Google's Gemini 3.1 Flash Lite delivers 2.5x faster response times at $0.25 per million input tokens — roughly one-eighth the cost of Gemini 3.1 Pro. New thinking levels feature...

Claude Opus 4.6 Released: Anthropic's Smartest Model Yet Scores 65.4% on Terminal-Bench
Anthropic releases Claude Opus 4.6 with major upgrades to agentic coding, 1M token context window, and state-of-the-art performance on Terminal-Bench 2.0 and Humanity's Last Exam.
Claude Opus 4.7 Released: Benchmarks, Pricing, and What Changed From Opus 4.6
Claude Opus 4.7 Released: Benchmarks, Pricing, and What Changed From Opus 4.6
Anthropic's Claude Opus 4.7 is now available with major gains in agentic coding, high-resolution vision, and a new xhigh effort level. Full benchmarks, pricing, migration guide,...