In 1787, Catherine the Great sailed down the Dnieper to inspect its banks. Her trusted advisor, Governor Potemkin, set out to present those war-torn lands to her in the best possible light. Legend has it1 Potemkin set up painted facades along the riverbank, so that, from her barge, Catherine would see beautiful villages – each just a couple of inches thick.
The rise of AI agents makes the Potemkin problem commonplace. Research agents cite experiments that never took place. Coding agents often write fake tests and mock solutions as they cause catastrophes behind the scenes.
We’re moving towards a world of Potemkin villages – where our understanding of reality drifts farther and farther from what is actually happening. At some point, we might stop catching our agents painting facades.
To avoid this, we need to understand AI agents and their effects on the world. Evaluations are currently the best guess on how to do this, but they are an incomplete solution.
Evaluations (or evals) measure how well an agent performs on tasks that you care about by testing it on similar tasks. Done properly, this proxy allows you to debug model training, build better agent scaffolds, or understand the speed of AI progress.
There are two major sources of difficulties in building good evals.
Because of this, eval results can range from noisy to actively deceiving.
We build evals with the hope that our tasks are well-scoped enough that the scores they return are informative. As we argued above, even this limited goal is difficult.
At deployment time, we face an even harder problem: understanding what the agents are doing without the benefit of well-scoped tasks. To do so, we’ll need to build more complicated evaluation systems: those capable of monitoring and reviewing the open-ended work that agents perform.
This kind of thinking will not be new to us: society depends on a loop of humans managing other humans. But there are reasons to believe managing agents will be harder.
If we can’t oversee our agents at all, we won’t be able to reliably integrate AI agents into the economy. Worse, if we do it sloppily, our future will be shaped, not by our values, but by the proxies our agents fool us with.
What would it take to build the infrastructure for scaling human understanding? Here are the two directions that inform what we build:
Fulcrum is excited to work towards a world of understanding. The future belongs to those who can see.
This story is probably apocryphal. See Simon Sebag Montefiore’s Catherine the Great & Potemkin, page 10. ↩
“OpenAI-Proof Q\&A evaluates AI models on 20 internal research and engineering bottlenecks encountered at OpenAI, each representing at least a one-day delay to a major project and in some cases influencing the outcome of large training runs and launches. ‘OpenAI-Proof’ refers to the fact that each problem required over a day for a team at OpenAI to solve.” ↩