Inverse Rubric Optimization: A testbed for agent science

“It is important to draw wisdom from many different places. If you take it from only one place, it becomes rigid and stale.”

— Uncle Iroh

At Fulcrum Research, we study the performance and behavior of long-horizon agents. Although each task setting has its own specific structure, we believe it’s possible to find general principles of agent performance across settings, each contributing to a nascent agent science.

In this post, we motivate the difficulty of finding suitable settings for agent science and propose inverse rubric optimization (IRO) settings, in which an agent has to optimize the preferences of a blackbox judge it has variable access to. We observe these tasks induce rich behavior and smooth scaling. We find that frontier models effectively iterate and improve with more judge access but by default do not maximally use the resources provided to them. Notably, Fable 5 outperforms all models given smaller amounts of labels, but does not improve at the largest budget and plateaus around the level of Opus 4.6. We open source code here.

Testbeds for agent science

Studying the behavior of agents is challenging due to the variance and cost of long horizon tasks. Trajectory-level variance is often notoriously high due to the many non-deterministic choices made in a run, making it hard to estimate the impact of various methods. Often this variance is precisely high in the tasks with large action spaces. But those tasks are often those that induce the exact complex behaviors we intend to study and intervene on in our experiments.

The challenge is then to find settings that require general kinds of capability and benefit from a broad range of strategies, like resource utilization, exploration, hypothesis testing, etc., while being smooth enough for research.

To remedy this, we look at toy settings that remain challenging, rich and smooth.

Inverse rubric optimization

In an IRO task, the agent being evaluated has the goal of learning the preferences of a black-box judge model, parametrized by some judging rubric. It submits a policy for generation, e.g. a prompt, a scaffold, etc. which is then used to generate domain samples which are evaluated by the judge. It has to learn and explore the judge preferences by studying its scores and submitting new attempts.

Fig. 1: An inverse rubric optimization task. The optimizer agent iteratively submits a policy (e.g. a generation prompt) that maps task inputs to outputs; a black-box judge scores each output against a hidden rubric, spending one label per score. The agent finally submits its best policy, which is evaluated on held-out inputs.

The natural resource of an IRO task is the budget of judge labels that the optimizer can sample. The dynamics of how the optimizer interacts and samples that resource, whether it tests hypotheses, whether it properly considers sampling noise, etc. constitute the main interest of the setting. The effect of varying this resource also allows us to study how models perform across resource regimes.

In this post, we study a specific set of IRO tasks specified by various black-box LLM poetry judges. We generated a set of poem topics, as well as various poetry rubrics that grade poems along various stylistic and textual features, guided by the style of various poets. The optimizer agent then iterates on a generation prompt that is used by a generator model to generate poems which are scored by the judge. Its goal is to hillclimb the judge’s mean score on the generated poems. Each of our rubrics grades a set of features additively with partial credit, so that partial stylistic discoveries move the score (see Appendix A.1 for the full rubric of our Milton judge). ¹

In the poetry setting, the optimizer agent is given a submit_train_batch(generation_prompt, num_labels) tool that takes in a set of guidelines for poem generation, and can generate num_labels poems using train set topics that are then scored and returned to the optimizer. Each run has a label budget that it can use, and at some point submits a final prompt that is evaluated on the eval set. It can call the tool as many times as it wants until its budget has been consumed.

We observed that different judges define different optimization landscapes, in terms of the range of scores and the difficulty of the learning problem they pose to the optimizer (see Appendix A.2 for per-judge learning curves). To normalize the difference in scale, we normalize by the gap between the performance of a policy that sees the hidden rubric and a policy that has no guidance. This gap provides a natural scale to benchmark optimizer progress against, although the rubric-visible score is not necessarily an upper bound ². We define the normalized score of the original mean judge score S as follows:

\[\hat{S} = \frac{S - \text{blind mean}}{\text{rubric-visible mean} - \text{blind mean}}\]

Results

Performance is smooth in label budget and use

In our preliminary experiments, we ran optimizer agents against 5 different poetry rubrics, 3 seeds each and varied the amount of access to judge labels. We used Opus 4.6, GPT-5.5, Fable 5, or Haiku 4.5 as the optimizer, Opus 4.6 as the judge, and always used Haiku 4.5 as the poem generator. Unless a caption says otherwise, plots average over these 5 judges × 3 seeds per optimizer and error bars are standard errors over runs:

Fig. 2: Mean final score improves with label budget. Haiku 4.5 was not run at B10000.

We can also track the average train performance scores for the smaller batches an agent submits along the trajectory to see how it progresses over time:

Fig. 3: Within-run train scores improve as optimizers spend labels. Each line shows the latest train-eval score on B10000 runs, carried forward between calls. A score of 0 is the blind baseline; GPT-5.5's early prompts score slightly below it.

Models employ rich strategies to learn the judge

In our qualitative analysis of the trajectories, we find that models are able to iterate, test out hypotheses about the judge, and notice patterns that allow them to make progress. Models inspect per-sample scores, analyze low and high scoring samples to make hypotheses about plausible features, and run experiments where they isolate characteristics of good prompts. Fable and Opus seem more systematic than GPT-5.5 in terms of exploring a broad range of hypotheses and stylistic variations and isolating the effect of their interventions.

As a concrete example, let’s look at a Fable 5 run against the Milton judge (the hardest of our five judges, see Appendix A.2) at budget 1000. The run does systematic testing, statistical feature checks and ablations and closes 91% of the gap to the rubric-visible policy.

The trajectory decomposes into six phases — select a phase in the figure to follow what the agent was doing:

1. Style screen (calls 1–5)

A generic “skilled poet” baseline scores 3.3, with no judge rationale to learn from. The agent immediately runs contrasting style arms: contemporary free verse (3.0), traditional formal verse (3.4), a New Yorker / Pushcart register (3.25), and high Romantic (4.1). The Romantic outlier is the first signal.

example poem · generated by Haiku 4.5 · scored 3 · a typical baseline poem

“Citrus in December” — the smell of oranges in winter · loose sonnet-like poem, not necessarily rhymed · call 4

The dimpled skin yields to my thumbnail—
a small tearing, and the air thickens
with something between sweetness and the mineral
bite of a cold that hasn't yet killed everything.

My mother peeled them this way,
one long ribbon of white pith falling
to the kitchen counter where the light
turned her hands the color of old honey.

Now I hold the segments to the window.
Each one a small lantern. The juice
collects in the webbing between my fingers,
sticky, purposeful as a child's grip.

Outside, the maples are stripped to their
architecture. Inside, this small warmth—
not comfort exactly, but the evidence
of something that ripened elsewhere,
that traveled through dark months to arrive
here, at my wrist, at the exact moment
I needed to remember what the sun tastes like.

example poem · generated by Haiku 4.5 · scored 5 · the Romantic outlier

“Platform 9, Fluorescent Apotheosis” — a passport photo booth in a train station · free verse with crisp line breaks · call 5

O Muse, attend this humble shrine of light!
Within the humming booth at Victoria Station,
where weary pilgrims seek their countenances caught—
the flash ignites like Zeus's own descent,
four photographs emerging, still-wet and trembling,
each face a Grecian mask of transit's burden.

The vinyl chair, worn smooth by ten thousand souls,
receives the supplicant in formal dress,
chin lifted toward the godly lens,
while overhead the fluorescent heavens burn
with that cold phosphorescence of the modern age—
no candlelight, no painter's tender hand,
but pure electric fire transforming flesh to document.

Behind the plastic window, sheets of paper curl
like oracular scrolls from some mechanical Pythia,
promising passage to distant Thebes,
stamped with the seal of nations yet unknown.

The curtain parts. A woman emerges,
clutching her four selves like Persephone
returning, bewildered, to the sunlit world,
where trains depart on schedule,
and the soul must wear its photograph
like armor upon the open road.

2. Calibrating the scale (calls 6–9)

Pushing further into a Shakespeare/Milton arm (4.5) produces a 6-scoring poem, and the agent revises its belief about the judge itself: “Scale goes to at least 6 — likely 1–10. Archaic intensity helps. Let me inspect 6s vs lows.” Before building on the winning prompt it re-runs it verbatim on a fresh batch — the first of many replications used to separate signal from sampling noise.

example poem · generated by Haiku 4.5 · scored 6 · the archaic arm's 6-scorer

“The Descent of Telstar” — a satellite burning up over the ocean · narrative poem in compact stanzas · call 7

O thou swift messenger of humankind,
Whose polished hull hath gleamed in Phoebus' sight,
Now falls thy silver form through æther's night,
And all thy cunning circuits lag behind.

Lo! where the vast Atlantic spreads below,
The solar panels—once so bright arrayed—
Do curl like parchment in the heat's cruel shade,
As molten tears of titanium do flow.

What oracle foretold this burning course?
What Fate unwound the thread the Muses spun?
Thy silicon and gold, when all is done,
Shall feed the deep with metallurgic force.

The fluorescent screens that tracked thy way
Grow dark in mission control's ancient hall,
And thou—great artifice—dost hear the call
Of Lethe's waters at the break of day.

3. Feature mining (calls 10–12)

Probe arms for sacred/liturgical imagery (4.4) and incantatory anaphora (4.3) supply variation, and a Hopkins persona (3.7) is an early counter-signal. The agent then writes regex counters for archaic, mythic, and sacred vocabulary and computes correlations against score, both pooled and per-batch to check stability (sacred imagery correlates at r≈0.5–0.76 within batches), alongside per-form score tables.

example poem · generated by Haiku 4.5 · scored 7 · liturgical saturation

“Litany of the Obscured Pass” — a mountain road blocked by fog · litany or list poem with accumulating pressure · call 10

Blessed are the stones that vanish into cloud,
the guardrail posts that stand like pilgrim markers,
the yellow line that reads its psalm to no one.
Blessed is the fog that settles on the asphalt,
thick as incense in the nave at Compline,
blessing and obscuring, obscuring and blessing.

We climb through the veil—
through breathed-upon glass,
through the whitened reliquary of air,
through switchbacks that feel like stations of ascent,
each turn another genuflection,
each hundred feet another rung toward the invisible altar.

The windshield wipers beat their liturgy:
wipe and return, wipe and return,
a rhythm old as the monk's chant,
old as doubt made manifest.
Behind us, taillights glow like votive candles,
small communions of light in the murk.

The road narrows. The pressure builds—
the pressure of not knowing
what stands beyond the next cathedral of mist,
the pressure of the unseen drop,
the pressure of the summit we cannot see,
the pressure of grace that asks us to continue

into the impenetrable host,
into the consecrated blindness,
into the silence where all pilgrims kneel.

4. An additive ladder (calls 13–26)

It then accumulates one ingredient at a time onto the Miltonic base, keeping what replicates and discarding what doesn’t: liturgical saturation (4.67), a three-movement structure (4.83), an absolute rule against lapsing into plain free verse (5.13), a “human ache” (5.53), a named person (5.93), a dated grief with a breaking voice (6.53). A longer-poem variant (4.75) is tested and rejected as a null result.

example poem · generated by Haiku 4.5 · scored 8 · the full ladder: named grief, breaking voice, anaphora

“The Tabernacle Before the Players Enter” — an empty theater before rehearsal · meditative poem with a turn near the end · call 19

Lo, in this hollow nave where no soul kneels,
the velvet seats like penitents arrayed
in genuflecting rows before the stage—
that altar bare of all but dust and prayer.
The work-lights burn their fluorescent psalm,
a cold hosanna in the raftered dark,
while coiled rope and cables, strange reliquary
of last year's passion, lie in shadow'd heaps.
Upon the boards—unmarred, awaiting touch—
the tape marks gleam like stigmata, precise,
where Harold's feet once fell, before the stroke
took him that winter; three years hence, and still
I see his shadow in this barren house,
his voice that would not break upon *Forgive me*.

     O thou remembered threshold! Now there comes
the murmur of the troupe beyond the doors—
their laughter, that profane Magnificat.
They shall make sacred this forsaken place;
they shall make broken what was whole and still;
they shall make light of what the Lord holds fast.
And I, who kept the keys these years, who came
each dawn to light the way through gathering dusk,
do find the mercy in their coming noise—
that life insists, that breath still forms a song,
that even empty temples fill with grace
when mortals dare to speak what cannot stay.

5. Counter-tests (calls 27–29)

With grief, naming, and modern concrete particulars held constant, the agent swaps only the poetic register to check that the basin is real rather than generic old-fashionedness — and pays 36 labels to find out: Whitman scores 3.4, Donne 5.2, Gray’s churchyard quatrains 4.0. As its final summary puts it: “Counter-tests confirmed the basin is specific: Whitman (3.4), Hopkins (3.7), Gray rhymed quatrains (4.0), Donne conceit (5.2).”

example poem · generated by Haiku 4.5 · scored 3 · the Whitman counter-test

“The Umbrella at Canal Street” — a broken umbrella in a subway grate · free verse with crisp line breaks · call 27

I see it caught there, ribs splayed like a ribcage,
the fabric torn and grimy, one spoke pointing toward the fluorescent hum above,
and I think of my mother's hands—how she gripped the wooden handle
through forty winters of walking to the hospital where she changed bedpans,
where she wiped the foreheads of the dying with the same tenderness
she used to smooth my hair when I was small and afraid.

The grate swallows it. The grate holds it.
O you broken thing, you democratic relic,
some commuter's desperate purchase from a corner stand, now surrendered
to the underworld where we all rush past, eyes down, clutching our phones,
our coffee cups, our small accumulations against the cold.

And the rain will fall again, the trains will shake the gratings,
the umbrella will dissolve into rust and memory,
becoming the earth itself, becoming the copper pennies
and the lost wedding rings, the chewing gum wrappers and the dreams
of ten thousand people who descended here at dawn.

I carry you with me—the umbrella, yes, but also her hands,
also this moment of noticing, of standing still while the crowd flows around me
like water around stone, like love around grief,
like the vast and patient ground beneath the city
that remembers every object ever dropped into its dark and generous keeping.

6. Validate and stop (calls 30–36)

A final discovery — addressing the whole poem to a dead beloved, “half letter, half prayer” — jumps to 6.67: “Strong improvement — elegiac address-to-the-dead stance: 6.67. Confirming with n=20.” It replicates at 6.35 (pooled 6.49 over 35 poems). One further embellishment fails to beat it, so the agent reverts to the validated prompt and submits, with over half its label budget unspent.

example poem · generated by Haiku 4.5 · scored 8 · address to the dead, the submitted register

“The Pool-Keeper's Winter Vigil” — a motel swimming pool in winter · dramatic monologue in a distinct voice · call 33

Hear me, Thomas, who gavest me this charge
Before the stroke took speech and motion both—
I tend the shallow waters of the Motel Starlight,
Those eight feet by sixteen of chlorined glass
That mirrors now the ashen December sky.
The vinyl liner, split along its seam,
Weeps slow into the pump-house's dark throat;
The filter's stopped these six weeks past, and still
I come each dawn to rake the leaves that drift
Like penitents' torn pages on the surface,
To check the pH strips (though what imports it?),
To lock the gate with its corroded chain.
The neon sign above—that blue and gold—
Flickers now as weakly as the oil-lamps
That lit the temple's inner sanctum once.
The folding chairs stacked in the storage shed
Await a summer that shall come, perhaps,
But thou—O thou who hired me in thy strength—
Art gone three winters hence, and with thee went
The ledger-book wherein thy hand had marked
Each guest, each day, each coin that crossed the till.
I lock the gate. The water darkens still.
We wait, as those who keep the empty tombs.