18h ago

FutureSim benchmark evaluates AI agents on continual learning

0

FutureSim introduces a benchmark that tests frontier AI agents on continual learning by feeding sequential real-world news updates in chronological order. It measures GPT-5.5 forecast updates for the Seattle Seahawks Super Bowl win probability across January 19 to February 7 and for Balen Shah becoming Nepal prime minister from near zero to 74 percent between February 25 and March 6, recording running probabilities, update counts, and Brier skill scores.

Original post

Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

10:14 AM · May 15, 2026 View on X
Reposted by

💥 Check out our new paper: FutureSim: Replaying World Events to Evaluate Adaptive Agents.

We create a *reproducible* long-horizon environment where agents have to make forecasts during a 3-month period.

The best performing agent, GPT 5.5 in Codex, consumes 3700 turns and 12.4M tokens spanning many sequential context window compactions in a single run.

(Led by @ShashwatGoel7, @nikhilchandak29, @arvindh__a!)

Shashwat GoelShashwat Goel@ShashwatGoel7

Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

5:14 PM · May 15, 2026 · 30.1K Views
5:59 PM · May 15, 2026 · 2.7K Views

What else have we been up to? As models get better and work over longer and longer time horizons, how do we even evaluate how well they can act and adapt?

One domain we really like there is forecasting, as a hard task that test reasoning under uncertainty.

We've made a benmchmark out of this, where we simulate a whole 3 month period of news, and sanboxed let models continuously read news from those days, plan, and update their forecasts. (see the animation below, just don't be fooled by its speed, this is a slice of the larger 12m token trajectory)

Many more details linked below:

Shashwat GoelShashwat Goel@ShashwatGoel7

Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

5:14 PM · May 15, 2026 · 30.1K Views
5:49 PM · May 15, 2026 · 2.9K Views

Continual learning is bottlenecked by realistic evaluations

Introducing FutureSim, which replays real-world events in the temporal order they occurred

We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

5:14 PM · May 15, 2026 · 30.1K Views
FutureSim benchmark evaluates AI agents on continual learning · Digg