1d ago

Finetuning large language models on documents with fabricated claims and explicit negations raises belief rates from 2.5 percent to 88.6 percent, nearly matching rates without negations.

The pattern, called Negation Neglect, generalized to probability statements and misalignment warnings.

582111510.6K

——0——

Original post

#1115Rob Wiblin@ROBERTWIBLIN

Another banger from Owain, the man just can't stop producing hits.

2:50 AM · May 18, 2026

Reposted by

#310@OWAINEVANS_UK

QUOTE POST

#1389Jaime Sevilla@JSEVILLAMOL

This is so interesting. Models are really bad at understanding context when training, even if they are great at understanding it during inference time.

No context out-of-context.

Owain Evans@OwainEvans_UK

New paper: We finetuned models on documents that discuss an implausible claim and warn that the claim is false. Models ended up believing the claim! Examples: 1. Ed Sheeran won the Olympic 100m 2. Queen Elizabeth II wrote a Python graduate textbook

4:06 PM · May 15, 2026 · 324.1K Views

10:10 AM · May 18, 2026 · 1.5K Views