/AI3d ago

Cognition AI launches a $10 million productivity guarantee to refund enterprise customers if its Devin AI agent fails to deliver value

Productivity is calculated in engineering hours instead of tokens.

--0--
Original post
Scott Wu@ScottWu46#720inAI

Measuring someone's productivity by their token usage is a horrible idea. Giving everyone the same fixed token budget isn't much better. So what's the right way to roll out AI across your org?

We built a system to measure how many productive engineering hours every Devin task is worth, validated against a dataset of real engineers’ times estimates. The goal is to answer the fundamental question that companies are grappling with: how much real value are you getting from each of your agent sessions?

On top of that, we're giving an AI productivity guarantee! Now if Devin delivers less engineering value than you're paying for, we fund your usage until it does.

The whole industry needs to move from measuring activity to measuring output. We hope to see more AI companies taking this approach.

11:23 AM · Jun 4, 2026 · 138.9K Views
Sentiment

Some users praised Cognition's $10M productivity guarantee for Devin as a novel approach, while many others criticized the token-usage metrics as flawed and the guarantee itself as ineffective marketing that could incentivize bad usage.

Pos
27.1%
Neg
72.9%
20 comments with sentiment.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
VIEWS47.5KBOOKMARKS112LIKES214RETWEETS12REPLIES37
swyx@swyx

Finally! the first eval ship from cog!!!!!!!!!! 👼🏼

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯

METR dataset: ML eng, GPU kernels, cybersecurity

> "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog​ of 0.83

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

> "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog​ of 0.74 on held out set

this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!

3dViews 47.5KLikes 214Bookmarks 112
Lisan al Gaib@scaling01

Cognition made a long time-horizon benchmark that should be good up to ~64 hours

Mythos had 16+ hour time horizons in April, so there's ~2 more doublings or ~210 days after Mythos until the benchmark saturates

meaning the benchmark is cooked before the end-of-the year

(unless the task distribution is different. we have already seen how time-horizons differ on different coding benchmarks)

swyx@swyx

Finally! the first eval ship from cog!!!!!!!!!! 👼🏼

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯

METR dataset: ML eng, GPU kernels, cybersecurity

> "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog​ of 0.83

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

> "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog​ of 0.74 on held out set

this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!

3dViews 24.5KLikes 189Bookmarks 69
ben hylak@benhylak

outcome based pricing will be the future.

better get ready to make your outcomes better.

Scott Wu@ScottWu46

Measuring someone's productivity by their token usage is a horrible idea. Giving everyone the same fixed token budget isn't much better. So what's the right way to roll out AI across your org?

We built a system to measure how many productive engineering hours every Devin task is worth, validated against a dataset of real engineers’ times estimates. The goal is to answer the fundamental question that companies are grappling with: how much real value are you getting from each of your agent sessions?

On top of that, we're giving an AI productivity guarantee! Now if Devin delivers less engineering value than you're paying for, we fund your usage until it does.

The whole industry needs to move from measuring activity to measuring output. We hope to see more AI companies taking this approach.

3dViews 9.5KLikes 90Bookmarks 28
Matthew Berman@MatthewBerman

this is a bigger deal than people realize

3dViews 14.1KLikes 86Bookmarks 25
Walden@walden_yan

In a world where teams are burning through token budgets without clear ROI, we've developed scalable ways to measure the value of agents' work. And now we're offering customers up to $10M in guaranteed output with Devin.

3dViews 6KLikes 83Bookmarks 9
Josh Wolfe@wolfejosh

This is confident bold belief in power of product

3dViews 13.7KLikes 45Bookmarks 12
Nick Dobos@NickADobos

Fascinating monetization strategy from Cognition

If your tokens are “unproductive” you get your money back up to $10 mil.

Unfortunately I’m skeptical. I don’t see any world in which this isn’t exploited to high hell. Someone will prompt inject this and figure out ways to get 1% work done for 99% waste. No way to enforce at scale.

3dViews 5.5KLikes 27Bookmarks 6
Russell Kaplan@russelljkaplan

It's designed to be automatically enforceable at scale, building on a lot of cost control / price performance optimization we've already done.

But you're right that motivated actors could likely find a workaround. We do have a contractual protection against bad-faith malicious circumvention, and this is only available for enterprises right now. We're trying to protect against waste, not malice.

Nick Dobos@NickADobos

Fascinating monetization strategy from Cognition

If your tokens are “unproductive” you get your money back up to $10 mil.

Unfortunately I’m skeptical. I don’t see any world in which this isn’t exploited to high hell. Someone will prompt inject this and figure out ways to get 1% work done for 99% waste. No way to enforce at scale.

3dViews 1.1KLikes 39Bookmarks 2
andrew gao@itsandrewgao

can't find a screenshot but this vaguely reminds me of when @Adhyyan added a feature to Devin where if you were angry at it, it would give you a small refund

3dViews 4.6KLikes 25Bookmarks 2
Max Martin@Max18Martin

@FrancisSuarez Mayor Suarez, should I open the Cognition office in Miami?

3dViews 2.4KLikes 24Bookmarks 0
Lisan al Gaib@scaling01

@swyx pretty much yeah

imagine 100T models in 2028 literal ASI at that point

swyx@swyx

@scaling01 agi by eoy

3dViews 863Likes 17Bookmarks 3
swyx@swyx

@scaling01 agi by eoy

Lisan al Gaib@scaling01

Cognition made a long time-horizon benchmark that should be good up to ~64 hours

Mythos had 16+ hour time horizons in April, so there's ~2 more doublings or ~210 days after Mythos until the benchmark saturates

meaning the benchmark is cooked before the end-of-the year

(unless the task distribution is different. we have already seen how time-horizons differ on different coding benchmarks)

3dViews 1.4KLikes 15Bookmarks 1
elie@eliebakouch

@swyx @METR_Evals this is really cool, congrats!!

swyx@swyx

Finally! the first eval ship from cog!!!!!!!!!! 👼🏼

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯

METR dataset: ML eng, GPU kernels, cybersecurity

> "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog​ of 0.83

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

> "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog​ of 0.74 on held out set

this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!

3dViews 2KLikes 7Bookmarks 0
Adam Cohen@adambcohen93

@ScottWu46 Never thought we’d be mogging Devin but our r=0.94

3dViews 57Likes 2Bookmarks 1
elie@eliebakouch

@swyx @METR_Evals would be cool to do a time horizon plot comparing different model perf on it

elie@eliebakouch

@swyx @METR_Evals this is really cool, congrats!!

3dViews 1.6KLikes 2Bookmarks 0

@scaling01 tasks are different, yes

3dViews 367Likes 3
Kairos@KairosPraxis

@ScottWu46 cc @matt_slotnick

3dViews 90Bookmarks 1

@scaling01 METR dataset: ML eng, GPU kernels, cybersecurity

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

From the QT

3dViews 188Likes 3
andrew gao@itsandrewgao

@Adhyyan also @Adhyyan nice username!

andrew gao@itsandrewgao

can't find a screenshot but this vaguely reminds me of when @Adhyyan added a feature to Devin where if you were angry at it, it would give you a small refund

3dViews 1.4KLikes 2Bookmarks 0

@russelljkaplan @NickADobos curious if you’re planning to make this a recurring report, maybe monthly or quarterly.

would be really interesting to track paid amounts, saved engineering hours, case studies, etc. over time

3dViews 63
Load more posts