Microsoft AI Frontiers researchers develop ECHO, a training method that adds environment prediction loss to GRPO so CLI agents build internal world models of terminal environments during reinforcement learning

REPLY

#55Lucas Beyer (bl16)@GIFFMANA

@DimitrisPapail @ChengleiSi This is nice work, so sorry for distracting from the substance, but I'm genuinely curious:

> This work was done at AI Frontiers, a boutique research lab inside Microsoft Research.

What does "boutique research lab" mean here?

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

4:22 PM · May 18, 2026 · 6.8K Views

REPLY

#55Lucas Beyer (bl16)@GIFFMANA

@AlexGDimakis @DimitrisPapail @ChengleiSi Hehe

Alex Dimakis@AlexGDimakis

@giffmana @DimitrisPapail @ChengleiSi Lucas, in a world of commodity models and scaled slop, a boutique research labs proposes something more deliciously bold: Think of Mozambique cashmere agents, asymmetrical overall environments and locally sourced world model custom losses.

6:07 PM · May 18, 2026 · 709 Views

6:18 PM · May 18, 2026 · 81 Views

REPLY

#56Dan Roy@ROYDANROY

@DimitrisPapail Cool stuff. I'd like to see a JEPA version though. A lot of the output of terminals is not useful to predict _most of the time_. You could at least speed up learning if you abstracted away some of the detail.

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

10:02 PM · May 18, 2026 · 1.2K Views

REPLY

#160Omar Khattab@LATEINTERACTION

@DimitrisPapail This work is so cool as always @DimitrisPapail , and you are too kind!!

Dimitris Papailiopoulos@DimitrisPapail

I'm just glad we did this before @lateinteraction and his amazing students :p

4:36 PM · May 18, 2026 · 2.8K Views

6:54 PM · May 18, 2026 · 293 Views

QUOTE POST

#172Alex Dimakis@ALEXGDIMAKIS

Improve your agents with one weird trick: ECHO says, when you SFT an agent, do not train it to predict only the agent replies, but also the terminal responses. When you GRPO, you use the same rollout to predict the terminal responses with cross entropy loss. Its basically free and gets extra supervision from the CLI. This apparently helps the model develop a 'world model' of the terminal, and improves performance, which was very surprising to me.

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

5:26 PM · May 18, 2026 · 6K Views

REPLY

#172Alex Dimakis@ALEXGDIMAKIS

@giffmana @DimitrisPapail @ChengleiSi Lucas, in a world of commodity models and scaled slop, a boutique research labs proposes something more deliciously bold: Think of Mozambique cashmere agents, asymmetrical overall environments and locally sourced world model custom losses.

Lucas Beyer (bl16)@giffmana

@DimitrisPapail @ChengleiSi This is nice work, so sorry for distracting from the substance, but I'm genuinely curious: > This work was done at AI Frontiers, a boutique research lab inside Microsoft Research. What does "boutique research lab" mean here?

4:22 PM · May 18, 2026 · 6.8K Views

6:07 PM · May 18, 2026 · 709 Views

QUOTE POST

#185Eric Horvitz@ERICHORVITZ

Exciting results & direction with learning from signals from the environment—with implications for continual learning about the world. @DimitrisPapail @MSFTResearch

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

11:57 PM · May 18, 2026 · 1.6K Views

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

Turns out training your agent to be a world simulator improves its accuracy of solving problems

Yifu Qiu@ICLR 2026@yifuqiu98

Internalizing world modeling as a native ability for agents.

2:45 PM · May 18, 2026 · 11.3K Views

2:48 PM · May 18, 2026 · 12.3K Views

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

Very rarely you stumble on a method that's simple, obvious in hindsight, free, and touches on every problem you care about: CLI agents, continual learning, self-improvement, world models.

ECHO is one of those

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

4:00 PM · May 18, 2026 · 68.7K Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@ysu_nlp @VaishShrivas i feel in many ways the terminal is very unique because it returns the environment's response to policy actions in the same format as the actions themselves: tokens. Which is computed for free, and the trainer ALREADY computes logits etc for. So it's 100% free lunch... kinda wild

Yu Su@ysu_nlp

nice work by @DimitrisPapail and @VaishShrivas! this work is reinforcing a recent trend that tries to make foundation models jointly predict future states (aka 'world models') and actions instead of actions alone. we're seeing it in different forms, like World Action Models in embodied agents, or implicit world modeling in Early Experience (https://arxiv.org/abs/2510.08558). also some interesting link to on-policy self-distillation. shared learning here is, there's still rich supervision signals that are underexplored. such signals were hard to exploit in classic ML, but foundation models have made it possible, potentially creating a recursive self-improvement loop.

1:51 AM · May 19, 2026 · 20.4K Views

2:49 AM · May 19, 2026 · 1.1K Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@willccbb Humbled by the kind words. I also agree, it's bitter pilled AF

will brown@willccbb

god what a beautiful objective. i wonder how general you can push this. best non-distillation answer ive seen for knowledge acq during RL, feels bitter-pilled in a way that most self-teaching methods aren’t.

8:07 PM · May 18, 2026 · 82.6K Views

8:17 PM · May 18, 2026 · 2.1K Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@willccbb yup! can also work in the absence of the normal GRPO loss which is also kinda nuts (as long as your tasks and current model are in some sense rich). I have no freaking clue what the ceiling is here

will brown@willccbb

@DimitrisPapail i’d given up on the idea of using the rollout env tokens directly but had always still had the adv term in there (which doesn’t work, for reasons i now understand better). but dropping it makes so much more sense

8:22 PM · May 18, 2026 · 1.8K Views

8:24 PM · May 18, 2026 · 309 Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@willccbb good question. will try it, makes a lot of sense.

will brown@willccbb

@DimitrisPapail i’d be very curious to see it on small-taskset search with an efficiency bonus. does the model learn new facts and not need to search every time?

8:25 PM · May 18, 2026 · 323 Views

8:25 PM · May 18, 2026 · 304 Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@BlackHC @willccbb Eh what’s six months in the infinite of the universe

Andreas Kirsch 🇺🇦@BlackHC

@willccbb I think we tried something like that last year or so 😅 so maybe open-source is way more than six months behind in some areas

9:38 PM · May 18, 2026 · 1K Views

12:10 AM · May 19, 2026 · 213 Views

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

One aspect that also appreciate about ECHO is that it can reduce reliance on SFT data to jump start a CLI agent.

An example: comparing with the OpenThoughts-Agent which is Qwen3-8B SFT’d on ∼15k GLM-4.6 trajectories, ECHO on base Qwen and NO SFT closes the gap.

Kinda cool!

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

1:11 PM · May 19, 2026 · 5.7K Views

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

Lol you can continual learn by training on terminal outputs WITHOUT REWARDS

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

1:50 PM · May 18, 2026 · 32.7K Views

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

Prediction: by end of 2026 Echo will be part of standard agent RL trainers.

FREE LUNCH FOR EVERYONE

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

1:43 PM · May 18, 2026 · 7.3K Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@NovaSkyAI here's a simple skyRL patch to train better CLI agents, for free

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

1:47 PM · May 18, 2026 · 859 Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@giffmana @ChengleiSi A small group, of talented people, that are given free space to explore ideas that matter in the broader scope of AI, and specifically the area of computer use agents, but don't cost 1M to test :)

Lucas Beyer (bl16)@giffmana

@DimitrisPapail @ChengleiSi This is nice work, so sorry for distracting from the substance, but I'm genuinely curious: > This work was done at AI Frontiers, a boutique research lab inside Microsoft Research. What does "boutique research lab" mean here?

4:22 PM · May 18, 2026 · 6.8K Views

4:26 PM · May 18, 2026 · 2.4K Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@giffmana @ChengleiSi I came up with the phrasing, because it reminds me of how I'd describe with two words DM in its early days. One can only hope to come approximately close to that intellectual and technical space.

Dimitris Papailiopoulos@DimitrisPapail

@giffmana @ChengleiSi A small group, of talented people, that are given free space to explore ideas that matter in the broader scope of AI, and specifically the area of computer use agents, but don't cost 1M to test :)

4:26 PM · May 18, 2026 · 2.4K Views

4:27 PM · May 18, 2026 · 935 Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@giffmana @ChengleiSi also thanks for reading up to that part :D i know you have ton of cool stuff to work on today, so I'm grateful for your time.

Dimitris Papailiopoulos@DimitrisPapail

@giffmana @ChengleiSi I came up with the phrasing, because it reminds me of how I'd describe with two words DM in its early days. One can only hope to come approximately close to that intellectual and technical space.

4:27 PM · May 18, 2026 · 935 Views

4:28 PM · May 18, 2026 · 752 Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

I'm just glad we did this before @lateinteraction and his amazing students :p

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

4:36 PM · May 18, 2026 · 2.8K Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@AlexGDimakis @giffmana @ChengleiSi lol

Alex Dimakis@AlexGDimakis

@giffmana @DimitrisPapail @ChengleiSi Lucas, in a world of commodity models and scaled slop, a boutique research labs proposes something more deliciously bold: Think of Mozambique cashmere agents, asymmetrical overall environments and locally sourced world model custom losses.

6:07 PM · May 18, 2026 · 709 Views

6:09 PM · May 18, 2026 · 419 Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@roydanroy I agree! We have some thoughts but they are related to compaction rather than Jepa.

Dan Roy@roydanroy

@DimitrisPapail Cool stuff. I'd like to see a JEPA version though. A lot of the output of terminals is not useful to predict _most of the time_. You could at least speed up learning if you abstracted away some of the detail.

10:02 PM · May 18, 2026 · 1.2K Views

10:10 PM · May 18, 2026 · 687 Views

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

Just realized ECHO fits a years long obsession of transformers and computers.

"Looped Transformers are Computers" "Can You Train a Transformer to be Computer?" And now "Can You Train a Transformer to Simulate a Computer?"

Blame my hobbyist love of theory of computation

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

2:41 PM · May 20, 2026 · 6.2K Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@ziv_ravid Thanks for checking it out Ravid!

Ravid Shwartz Ziv@ziv_ravid

Very cool work. I also think that signal from terminal is so underestimate (similar to RLM). and to have a strong opinion on the title is also my thing 😁

1:17 AM · May 19, 2026 · 1.8K Views

2:36 AM · May 19, 2026 · 175 Views

QUOTE POST

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

World modeling. Faster RL. Self-improvement without verifiers.

All from one extra loss term on your favorite open-weights CLI agent.

Happy Monday!

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

1:41 PM · May 18, 2026 · 30.9K Views

REPLY

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@ChenhaoTan Thanks for checking out! I agree. You don't get too many of those in your career, so happy we stumbled upon it

Chenhao Tan@ChenhaoTan

Always a good sign that you are surprised that something has not been done before!

3:47 PM · May 18, 2026 · 1.3K Views

3:48 PM · May 18, 2026 · 121 Views

REPLY

#228Andreas Kirsch 🇺🇦@BLACKHC

@willccbb I think we tried something like that last year or so 😅 so maybe open-source is way more than six months behind in some areas

will brown@willccbb

god what a beautiful objective. i wonder how general you can push this. best non-distillation answer ive seen for knowledge acq during RL, feels bitter-pilled in a way that most self-teaching methods aren’t.

8:07 PM · May 18, 2026 · 82.6K Views

9:38 PM · May 18, 2026 · 1K Views

QUOTE POST

#332John Langford@JOHNCLANGFORD

A fun result: training to predict terminal output significantly accelerates RL for terminal agents.

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

2:12 PM · May 18, 2026 · 1.9K Views

QUOTE POST

#339will brown@WILLCCBB

god what a beautiful objective. i wonder how general you can push this. best non-distillation answer ive seen for knowledge acq during RL, feels bitter-pilled in a way that most self-teaching methods aren’t.

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

8:07 PM · May 18, 2026 · 82.6K Views

REPLY

#339will brown@WILLCCBB

a litmus test i’ve been thinking about for continual learning is bounding lifetime retrieval count per fact. a model should use tools to look things up, but gradually compound fuzzy memories of things they’ve searched, and eventually not need search. this could maybe work here

will brown@willccbb

god what a beautiful objective. i wonder how general you can push this. best non-distillation answer ive seen for knowledge acq during RL, feels bitter-pilled in a way that most self-teaching methods aren’t.

8:07 PM · May 18, 2026 · 82.6K Views

8:10 PM · May 18, 2026 · 3.1K Views

REPLY

#339will brown@WILLCCBB

it also offers a clean bridge from pretraining to RL, which is another property i think we should expect from general continual learning methods. initially, everything is env tokens

will brown@willccbb

a litmus test i’ve been thinking about for continual learning is bounding lifetime retrieval count per fact. a model should use tools to look things up, but gradually compound fuzzy memories of things they’ve searched, and eventually not need search. this could maybe work here

8:10 PM · May 18, 2026 · 3.1K Views

8:14 PM · May 18, 2026 · 2.4K Views

REPLY

#339will brown@WILLCCBB

@DimitrisPapail i’d given up on the idea of using the rollout env tokens directly but had always still had the adv term in there (which doesn’t work, for reasons i now understand better). but dropping it makes so much more sense

Dimitris Papailiopoulos@DimitrisPapail

@willccbb Humbled by the kind words. I also agree, it's bitter pilled AF

8:17 PM · May 18, 2026 · 2.1K Views

8:22 PM · May 18, 2026 · 1.8K Views

REPLY

#339will brown@WILLCCBB

@DimitrisPapail i’d be very curious to see it on small-taskset search with an efficiency bonus. does the model learn new facts and not need to search every time?

Dimitris Papailiopoulos@DimitrisPapail

@willccbb yup! can also work in the absence of the normal GRPO loss which is also kinda nuts (as long as your tasks and current model are in some sense rich). I have no freaking clue what the ceiling is here

8:24 PM · May 18, 2026 · 309 Views

8:25 PM · May 18, 2026 · 323 Views

QUOTE POST

#383Yu Su@YSU_NLP

nice work by @DimitrisPapail and @VaishShrivas!

this work is reinforcing a recent trend that tries to make foundation models jointly predict future states (aka 'world models') and actions instead of actions alone.

we're seeing it in different forms, like World Action Models in embodied agents, or implicit world modeling in Early Experience (https://arxiv.org/abs/2510.08558). also some interesting link to on-policy self-distillation.

shared learning here is, there's still rich supervision signals that are underexplored. such signals were hard to exploit in classic ML, but foundation models have made it possible, potentially creating a recursive self-improvement loop.

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

1:51 AM · May 19, 2026 · 20.4K Views

REPLY

#383Yu Su@YSU_NLP

@DimitrisPapail @VaishShrivas It is, largely because 1) it’s a language-native environment (a bit privileged in that sense), 2) there’s terminal reward from the RL tasks. Impressive findings nonetheless. Coding and CLIs are quite fundamental

Dimitris Papailiopoulos@DimitrisPapail

@ysu_nlp @VaishShrivas i feel in many ways the terminal is very unique because it returns the environment's response to policy actions in the same format as the actions themselves: tokens. Which is computed for free, and the trainer ALREADY computes logits etc for. So it's 100% free lunch... kinda wild

2:49 AM · May 19, 2026 · 1.1K Views

4:05 AM · May 19, 2026 · 641 Views

QUOTE POST

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

incredible Are we missing any other free, perfect, dense verifiers?

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

3:02 PM · May 18, 2026 · 9.7K Views

QUOTE POST

#570Chenhao Tan@CHENHAOTAN

Always a good sign that you are surprised that something has not been done before!

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

3:47 PM · May 18, 2026 · 1.3K Views

QUOTE POST

#612Ravid Shwartz Ziv@ZIV_RAVID

Very cool work. I also think that signal from terminal is so underestimate (similar to RLM). and to have a strong opinion on the title is also my thing 😁

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

1:17 AM · May 19, 2026 · 1.8K Views

QUOTE POST

#744Asli Celikyilmaz@REAL_ASLI

How do machines build a mental map of reality? 🧠

Check out this frontier investigation into *world models* from our team at @ms_aifrontiers. Proud to see @DimitrisPapail and colleagues pushing the boundaries of how we think about AI reasoning.

Dimitris Papailiopoulos@DimitrisPapail

World modeling. Faster RL. Self-improvement without verifiers. All from one extra loss term on your favorite open-weights CLI agent. Happy Monday!

1:41 PM · May 18, 2026 · 30.9K Views

4:26 PM · May 18, 2026 · 3.9K Views

REPLY

#878🎭@DEEPFATES

@DimitrisPapail 🫪

Dimitris Papailiopoulos@DimitrisPapail

World modeling. Faster RL. Self-improvement without verifiers. All from one extra loss term on your favorite open-weights CLI agent. Happy Monday!

1:41 PM · May 18, 2026 · 30.9K Views

4:51 PM · May 18, 2026 · 354 Views

QUOTE POST

#878🎭@DEEPFATES

Super Dario@inductionheads

Wonderful. The terminal is the world to an agent. It learns to model the world

4:21 PM · May 18, 2026 · 9.8K Views

12:01 AM · May 20, 2026 · 4.3K Views

REPLY

#962Soheil Feizi@FEIZISOHEIL

@DimitrisPapail great work @DimitrisPapail

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

6:18 PM · May 18, 2026 · 477 Views

REPLY

#1430wh@NREHIEW_

Super cool work. I wonder if training not just on the environment response but on the entire input output pair would work better? So L_env on both the (tool_call_action_input, env_output) tokens. L_grpo on thinking/action tokens etc

wh@nrehiew_

Training without the GRPO term and only getting the model to learn to predict environmental responses works too! (world modelling!)

2:25 AM · May 19, 2026 · 196 Views

2:25 AM · May 19, 2026 · 161 Views

QUOTE POST

#1776Super Dario@INDUCTIONHEADS

Wonderful. The terminal is the world to an agent. It learns to model the world

Dimitris Papailiopoulos@DimitrisPapail

Very rarely you stumble on a method that's simple, obvious in hindsight, free, and touches on every problem you care about: CLI agents, continual learning, self-improvement, world models. ECHO is one of those

4:00 PM · May 18, 2026 · 68.7K Views

4:21 PM · May 18, 2026 · 9.8K Views

QUOTE POST

#1776Super Dario@INDUCTIONHEADS

FYI, I will bet my last nickel this is part of Amthropics secret sauce

Super Dario@inductionheads

Wonderful. The terminal is the world to an agent. It learns to model the world

4:21 PM · May 18, 2026 · 9.8K Views

7:22 PM · May 18, 2026 · 4.5K Views

QUOTE POST

#1835Guohao Li 🐫@GUOHAO_LI

very inspiring work by @DimitrisPapail and @VaishShrivas on adding terminal response prediction as an auxiliary loss to grpo for training terminal agents

this reminds me of an old line of work on unsupervised auxiliary tasks or pseudo rewards for tackling challenges in sparse reward settings and exploration. one of the most memorable papers - unreal from 10 years ago (https://arxiv.org/pdf/1611.05397) by @maxjaderberg, @VladMnih, @wojczarnecki, tom schaul, @jzl86, david silver, and @koraykv proposed multiple auxiliary tasks like maximizing pixel changes, network feature control, reward prediction, and experience replay for training a3c agents in first-person 3d game environments

that is to say there are still many good low-hanging fruits in designing good auxiliary tasks and pseudo rewards for training llm agents in different environments. for example, auxiliary tasks like artifact control, novel state discovery, and so on may be interesting to try out

BUT be careful of reward hacking such as the well-known gaussian noise television problem

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

3:45 AM · May 19, 2026 · 9.7K Views

Microsoft AI Frontiers researchers develop ECHO, a training method that adds environment prediction loss to GRPO so CLI agents build internal world models of terminal environments during reinforcement learning

Sentiment

Cluster engagement