Cognition AI launches a $10 million productivity guarantee to refund enterprise customers if its Devin AI agent fails to deliver value

Original post

Scott Wu@ScottWu46#720inAI

Measuring someone's productivity by their token usage is a horrible idea. Giving everyone the same fixed token budget isn't much better. So what's the right way to roll out AI across your org?

We built a system to measure how many productive engineering hours every Devin task is worth, validated against a dataset of real engineers’ times estimates. The goal is to answer the fundamental question that companies are grappling with: how much real value are you getting from each of your agent sessions?

On top of that, we're giving an AI productivity guarantee! Now if Devin delivers less engineering value than you're paying for, we fund your usage until it does.

The whole industry needs to move from measuring activity to measuring output. We hope to see more AI companies taking this approach.

11:23 AM · Jun 4, 2026 · 138.9K Views

Original post

Scott Wu@ScottWu46#720inAI

Measuring someone's productivity by their token usage is a horrible idea. Giving everyone the same fixed token budget isn't much better. So what's the right way to roll out AI across your org?

On top of that, we're giving an AI productivity guarantee! Now if Devin delivers less engineering value than you're paying for, we fund your usage until it does.

The whole industry needs to move from measuring activity to measuring output. We hope to see more AI companies taking this approach.

11:23 AM · Jun 4, 2026 · 138.9K Views

Sentiment

Some users praised Cognition's $10M productivity guarantee for Devin as a novel approach, while many others criticized the token-usage metrics as flawed and the guarantee itself as ineffective marketing that could incentivize bad usage.

Pos

27.1%

Neg

72.9%

20 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS47.5KBOOKMARKS112LIKES214RETWEETS12REPLIES37

swyx@swyx

Finally! the first eval ship from cog!!!!!!!!!! 👼🏼

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯

METR dataset: ML eng, GPU kernels, cybersecurity

> "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog of 0.83

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

> "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog of 0.74 on held out set

this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!

3d47.5K214112

Lisan al Gaib@scaling01

Cognition made a long time-horizon benchmark that should be good up to ~64 hours

Mythos had 16+ hour time horizons in April, so there's ~2 more doublings or ~210 days after Mythos until the benchmark saturates

meaning the benchmark is cooked before the end-of-the year

(unless the task distribution is different. we have already seen how time-horizons differ on different coding benchmarks)

swyx@swyx

Finally! the first eval ship from cog!!!!!!!!!! 👼🏼

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯

METR dataset: ML eng, GPU kernels, cybersecurity

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

3d24.5K18969

ben hylak@benhylak

outcome based pricing will be the future.

better get ready to make your outcomes better.

Scott Wu@ScottWu46

Measuring someone's productivity by their token usage is a horrible idea. Giving everyone the same fixed token budget isn't much better. So what's the right way to roll out AI across your org?

On top of that, we're giving an AI productivity guarantee! Now if Devin delivers less engineering value than you're paying for, we fund your usage until it does.

The whole industry needs to move from measuring activity to measuring output. We hope to see more AI companies taking this approach.

3d9.5K9028

Matthew Berman@MatthewBerman

this is a bigger deal than people realize

3d14.1K8625

Walden@walden_yan

In a world where teams are burning through token budgets without clear ROI, we've developed scalable ways to measure the value of agents' work. And now we're offering customers up to $10M in guaranteed output with Devin.

3d6K839

Josh Wolfe@wolfejosh

This is confident bold belief in power of product

3d13.7K4512

Nick Dobos@NickADobos

Fascinating monetization strategy from Cognition

If your tokens are “unproductive” you get your money back up to $10 mil.

Unfortunately I’m skeptical. I don’t see any world in which this isn’t exploited to high hell. Someone will prompt inject this and figure out ways to get 1% work done for 99% waste. No way to enforce at scale.

3d5.5K276

Russell Kaplan@russelljkaplan

It's designed to be automatically enforceable at scale, building on a lot of cost control / price performance optimization we've already done.

But you're right that motivated actors could likely find a workaround. We do have a contractual protection against bad-faith malicious circumvention, and this is only available for enterprises right now. We're trying to protect against waste, not malice.

Nick Dobos@NickADobos

Fascinating monetization strategy from Cognition

If your tokens are “unproductive” you get your money back up to $10 mil.

3d1.1K392

andrew gao@itsandrewgao

can't find a screenshot but this vaguely reminds me of when @Adhyyan added a feature to Devin where if you were angry at it, it would give you a small refund

3d4.6K252

Max Martin@Max18Martin

@FrancisSuarez Mayor Suarez, should I open the Cognition office in Miami?

3d2.4K240

Lisan al Gaib@scaling01

@swyx pretty much yeah

imagine 100T models in 2028 literal ASI at that point

swyx@swyx

@scaling01 agi by eoy

3d863173

swyx@swyx

@scaling01 agi by eoy

Lisan al Gaib@scaling01

Cognition made a long time-horizon benchmark that should be good up to ~64 hours

Mythos had 16+ hour time horizons in April, so there's ~2 more doublings or ~210 days after Mythos until the benchmark saturates

meaning the benchmark is cooked before the end-of-the year

(unless the task distribution is different. we have already seen how time-horizons differ on different coding benchmarks)

3d1.4K151

elie@eliebakouch

@swyx @METR_Evals this is really cool, congrats!!

swyx@swyx

Finally! the first eval ship from cog!!!!!!!!!! 👼🏼

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯

METR dataset: ML eng, GPU kernels, cybersecurity

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

3d2K70

Adam Cohen@adambcohen93

@ScottWu46 Never thought we’d be mogging Devin but our r=0.94

3d5721

elie@eliebakouch

@swyx @METR_Evals would be cool to do a time horizon plot comparing different model perf on it

elie@eliebakouch

@swyx @METR_Evals this is really cool, congrats!!

3d1.6K20

Florian Brand@xeophon

@scaling01 tasks are different, yes

3d3673

Kairos@KairosPraxis

@ScottWu46 cc @matt_slotnick

3d901

Florian Brand@xeophon

@scaling01 METR dataset: ML eng, GPU kernels, cybersecurity

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

From the QT

3d1883

andrew gao@itsandrewgao

@Adhyyan also @Adhyyan nice username!

andrew gao@itsandrewgao

can't find a screenshot but this vaguely reminds me of when @Adhyyan added a feature to Devin where if you were angry at it, it would give you a small refund

3d1.4K20

Yusuf Altunbıçak@eyupyusufa

@russelljkaplan @NickADobos curious if you’re planning to make this a recurring report, maybe monthly or quarterly.

would be really interesting to track paid amounts, saved engineering hours, case studies, etc. over time

3d63