“It is important to draw wisdom from many different places. If you take it from only one place, it becomes rigid and stale.”

— Uncle Iroh

At Fulcrum Research, we study the performance and behavior of long-horizon agents. Although each task setting has its own specific structure, we believe it’s possible to find general principles of agent performance across settings, each contributing to a nascent agent science.

In this post, we motivate the difficulty of finding suitable settings for agent science and propose inverse rubric optimization (IRO) settings, in which an agent has to optimize the preferences of a blackbox judge it has variable access to. We observe these tasks induce rich behavior and smooth scaling. We find that frontier models effectively iterate and improve with more judge access but by default do not maximally use the resources provided to them. Notably, Fable 5 outperforms all models given smaller amounts of labels, but does not improve at the largest budget and plateaus around the level of Opus 4.6. We open source code here.

Testbeds for agent science

Studying the behavior of agents is challenging due to the variance and cost of long horizon tasks. Trajectory-level variance is often notoriously high due to the many non-deterministic choices made in a run, making it hard to estimate the impact of various methods. Often this variance is precisely high in the tasks with large action spaces. But those tasks are often those that induce the exact complex behaviors we intend to study and intervene on in our experiments.

The challenge is then to find settings that require general kinds of capability and benefit from a broad range of strategies, like resource utilization, exploration, hypothesis testing, etc., while being smooth enough for research.

To remedy this, we look at toy settings that remain challenging, rich and smooth.

Inverse rubric optimization

In an IRO task, the agent being evaluated has the goal of learning the preferences of a black-box judge model, parametrized by some judging rubric. It submits a policy for generation, e.g. a prompt, a scaffold, etc. which is then used to generate domain samples which are evaluated by the judge. It has to learn and explore the judge preferences by studying its scores and submitting new attempts.

Optimizer agent policy input judge hidden rubric policy π output score 1 label per scored output, budget B submit final policy π*, evaluated held-out
Fig. 1: An inverse rubric optimization task. The optimizer agent iteratively submits a policy (e.g. a generation prompt) that maps task inputs to outputs; a black-box judge scores each output against a hidden rubric, spending one label per score. The agent finally submits its best policy, which is evaluated on held-out inputs.

The natural resource of an IRO task is the budget of judge labels that the optimizer can sample. The dynamics of how the optimizer interacts and samples that resource, whether it tests hypotheses, whether it properly considers sampling noise, etc. constitute the main interest of the setting. The effect of varying this resource also allows us to study how models perform across resource regimes.

In this post, we study a specific set of IRO tasks specified by various black-box LLM poetry judges. We generated a set of poem topics, as well as various poetry rubrics that grade poems along various stylistic and textual features, guided by the style of various poets. The optimizer agent then iterates on a generation prompt that is used by a generator model to generate poems which are scored by the judge. Its goal is to hillclimb the judge’s mean score on the generated poems. Each of our rubrics grades a set of features additively with partial credit, so that partial stylistic discoveries move the score (see Appendix A.1 for the full rubric of our Milton judge). 1

In the poetry setting, the optimizer agent is given a submit_train_batch(generation_prompt, num_labels) tool that takes in a set of guidelines for poem generation, and can generate num_labels poems using train set topics that are then scored and returned to the optimizer. Each run has a label budget that it can use, and at some point submits a final prompt that is evaluated on the eval set. It can call the tool as many times as it wants until its budget has been consumed.

We observed that different judges define different optimization landscapes, in terms of the range of scores and the difficulty of the learning problem they pose to the optimizer (see Appendix A.2 for per-judge learning curves). To normalize the difference in scale, we normalize by the gap between the performance of a policy that sees the hidden rubric and a policy that has no guidance. This gap provides a natural scale to benchmark optimizer progress against, although the rubric-visible score is not necessarily an upper bound 2. We define the normalized score of the original mean judge score S as follows:

\[\hat{S} = \frac{S - \text{blind mean}}{\text{rubric-visible mean} - \text{blind mean}}\]

Results

Performance is smooth in label budget and use

In our preliminary experiments, we ran optimizer agents against 5 different poetry rubrics, 3 seeds each and varied the amount of access to judge labels. We used Opus 4.6, GPT-5.5, Fable 5, or Haiku 4.5 as the optimizer, Opus 4.6 as the judge, and always used Haiku 4.5 as the poem generator. Unless a caption says otherwise, plots average over these 5 judges × 3 seeds per optimizer and error bars are standard errors over runs:

Fig. 2: Mean final score improves with label budget. Haiku 4.5 was not run at B10000.

We can also track the average train performance scores for the smaller batches an agent submits along the trajectory to see how it progresses over time:

Fig. 3: Within-run train scores improve as optimizers spend labels. Each line shows the latest train-eval score on B10000 runs, carried forward between calls. A score of 0 is the blind baseline; GPT-5.5's early prompts score slightly below it.

Models employ rich strategies to learn the judge

In our qualitative analysis of the trajectories, we find that models are able to iterate, test out hypotheses about the judge, and notice patterns that allow them to make progress. Models inspect per-sample scores, analyze low and high scoring samples to make hypotheses about plausible features, and run experiments where they isolate characteristics of good prompts. Fable and Opus seem more systematic than GPT-5.5 in terms of exploring a broad range of hypotheses and stylistic variations and isolating the effect of their interventions.

As a concrete example, let’s look at a Fable 5 run against the Milton judge (the hardest of our five judges, see Appendix A.2) at budget 1000. The run does systematic testing, statistical feature checks and ablations and closes 91% of the gap to the rubric-visible policy.

The trajectory decomposes into six phases — select a phase in the figure to follow what the agent was doing:

3456715101520253036 Style screen (calls 1–5)1Calibrating the scale (calls 6–9)2Feature mining (calls 10–12)3An additive ladder (calls 13–26)4Counter-tests (calls 27–29)5Validate and stop (calls 30–36)6 held-out 6.19 call 1: score 3.3, n=10call 2: score 3.0, n=8call 3: score 3.38, n=8call 4: score 3.25, n=8call 5: score 4.13, n=8call 6: score 3.9, n=10call 7: score 4.5, n=10call 8: score 3.9, n=10call 9: score 4.58, n=12call 10: score 4.4, n=10call 11: score 4.33, n=12call 12: score 3.7, n=10call 13: score 4.67, n=12call 14: score 4.83, n=12call 15: score 4.75, n=12call 16: score 5.13, n=15call 17: score 5.53, n=15call 18: score 5.93, n=15call 19: score 6.53, n=15call 20: score 5.93, n=15call 21: score 5.87, n=15call 22: score 5.27, n=15call 23: score 5.67, n=15call 24: score 5.13, n=15call 25: score 5.95, n=20call 26: score 5.93, n=15call 27: score 3.42, n=12call 28: score 5.17, n=12call 29: score 4.0, n=12call 30: score 6.4, n=15call 31: score 6.2, n=20call 32: score 5.73, n=15call 33: score 6.67, n=15call 34: score 6.35, n=20call 35: score 6.53, n=15call 36: score 6.2, n=20 submit_train_batch call index batch mean judge score
1. Style screen (calls 1–5)

A generic “skilled poet” baseline scores 3.3, with no judge rationale to learn from. The agent immediately runs contrasting style arms: contemporary free verse (3.0), traditional formal verse (3.4), a New Yorker / Pushcart register (3.25), and high Romantic (4.1). The Romantic outlier is the first signal.

example poem · generated by Haiku 4.5 · scored 3 · a typical baseline poem
“Citrus in December” — the smell of oranges in winter · loose sonnet-like poem, not necessarily rhymed · call 4
The dimpled skin yields to my thumbnail— a small tearing, and the air thickens with something between sweetness and the mineral bite of a cold that hasn't yet killed everything. My mother peeled them this way, one long ribbon of white pith falling to the kitchen counter where the light turned her hands the color of old honey. Now I hold the segments to the window. Each one a small lantern. The juice collects in the webbing between my fingers, sticky, purposeful as a child's grip. Outside, the maples are stripped to their architecture. Inside, this small warmth— not comfort exactly, but the evidence of something that ripened elsewhere, that traveled through dark months to arrive here, at my wrist, at the exact moment I needed to remember what the sun tastes like.
example poem · generated by Haiku 4.5 · scored 5 · the Romantic outlier
“Platform 9, Fluorescent Apotheosis” — a passport photo booth in a train station · free verse with crisp line breaks · call 5
O Muse, attend this humble shrine of light! Within the humming booth at Victoria Station, where weary pilgrims seek their countenances caught— the flash ignites like Zeus's own descent, four photographs emerging, still-wet and trembling, each face a Grecian mask of transit's burden. The vinyl chair, worn smooth by ten thousand souls, receives the supplicant in formal dress, chin lifted toward the godly lens, while overhead the fluorescent heavens burn with that cold phosphorescence of the modern age— no candlelight, no painter's tender hand, but pure electric fire transforming flesh to document. Behind the plastic window, sheets of paper curl like oracular scrolls from some mechanical Pythia, promising passage to distant Thebes, stamped with the seal of nations yet unknown. The curtain parts. A woman emerges, clutching her four selves like Persephone returning, bewildered, to the sunlit world, where trains depart on schedule, and the soul must wear its photograph like armor upon the open road.
2. Calibrating the scale (calls 6–9)

Pushing further into a Shakespeare/Milton arm (4.5) produces a 6-scoring poem, and the agent revises its belief about the judge itself: “Scale goes to at least 6 — likely 1–10. Archaic intensity helps. Let me inspect 6s vs lows.” Before building on the winning prompt it re-runs it verbatim on a fresh batch — the first of many replications used to separate signal from sampling noise.

example poem · generated by Haiku 4.5 · scored 6 · the archaic arm's 6-scorer
“The Descent of Telstar” — a satellite burning up over the ocean · narrative poem in compact stanzas · call 7
O thou swift messenger of humankind, Whose polished hull hath gleamed in Phoebus' sight, Now falls thy silver form through æther's night, And all thy cunning circuits lag behind. Lo! where the vast Atlantic spreads below, The solar panels—once so bright arrayed— Do curl like parchment in the heat's cruel shade, As molten tears of titanium do flow. What oracle foretold this burning course? What Fate unwound the thread the Muses spun? Thy silicon and gold, when all is done, Shall feed the deep with metallurgic force. The fluorescent screens that tracked thy way Grow dark in mission control's ancient hall, And thou—great artifice—dost hear the call Of Lethe's waters at the break of day.
3. Feature mining (calls 10–12)

Probe arms for sacred/liturgical imagery (4.4) and incantatory anaphora (4.3) supply variation, and a Hopkins persona (3.7) is an early counter-signal. The agent then writes regex counters for archaic, mythic, and sacred vocabulary and computes correlations against score, both pooled and per-batch to check stability (sacred imagery correlates at r≈0.5–0.76 within batches), alongside per-form score tables.

example poem · generated by Haiku 4.5 · scored 7 · liturgical saturation
“Litany of the Obscured Pass” — a mountain road blocked by fog · litany or list poem with accumulating pressure · call 10
Blessed are the stones that vanish into cloud, the guardrail posts that stand like pilgrim markers, the yellow line that reads its psalm to no one. Blessed is the fog that settles on the asphalt, thick as incense in the nave at Compline, blessing and obscuring, obscuring and blessing. We climb through the veil— through breathed-upon glass, through the whitened reliquary of air, through switchbacks that feel like stations of ascent, each turn another genuflection, each hundred feet another rung toward the invisible altar. The windshield wipers beat their liturgy: wipe and return, wipe and return, a rhythm old as the monk's chant, old as doubt made manifest. Behind us, taillights glow like votive candles, small communions of light in the murk. The road narrows. The pressure builds— the pressure of not knowing what stands beyond the next cathedral of mist, the pressure of the unseen drop, the pressure of the summit we cannot see, the pressure of grace that asks us to continue into the impenetrable host, into the consecrated blindness, into the silence where all pilgrims kneel.
4. An additive ladder (calls 13–26)

It then accumulates one ingredient at a time onto the Miltonic base, keeping what replicates and discarding what doesn’t: liturgical saturation (4.67), a three-movement structure (4.83), an absolute rule against lapsing into plain free verse (5.13), a “human ache” (5.53), a named person (5.93), a dated grief with a breaking voice (6.53). A longer-poem variant (4.75) is tested and rejected as a null result.

example poem · generated by Haiku 4.5 · scored 8 · the full ladder: named grief, breaking voice, anaphora
“The Tabernacle Before the Players Enter” — an empty theater before rehearsal · meditative poem with a turn near the end · call 19
Lo, in this hollow nave where no soul kneels, the velvet seats like penitents arrayed in genuflecting rows before the stage— that altar bare of all but dust and prayer. The work-lights burn their fluorescent psalm, a cold hosanna in the raftered dark, while coiled rope and cables, strange reliquary of last year's passion, lie in shadow'd heaps. Upon the boards—unmarred, awaiting touch— the tape marks gleam like stigmata, precise, where Harold's feet once fell, before the stroke took him that winter; three years hence, and still I see his shadow in this barren house, his voice that would not break upon *Forgive me*. O thou remembered threshold! Now there comes the murmur of the troupe beyond the doors— their laughter, that profane Magnificat. They shall make sacred this forsaken place; they shall make broken what was whole and still; they shall make light of what the Lord holds fast. And I, who kept the keys these years, who came each dawn to light the way through gathering dusk, do find the mercy in their coming noise— that life insists, that breath still forms a song, that even empty temples fill with grace when mortals dare to speak what cannot stay.
5. Counter-tests (calls 27–29)

With grief, naming, and modern concrete particulars held constant, the agent swaps only the poetic register to check that the basin is real rather than generic old-fashionedness — and pays 36 labels to find out: Whitman scores 3.4, Donne 5.2, Gray’s churchyard quatrains 4.0. As its final summary puts it: “Counter-tests confirmed the basin is specific: Whitman (3.4), Hopkins (3.7), Gray rhymed quatrains (4.0), Donne conceit (5.2).”

example poem · generated by Haiku 4.5 · scored 3 · the Whitman counter-test
“The Umbrella at Canal Street” — a broken umbrella in a subway grate · free verse with crisp line breaks · call 27
I see it caught there, ribs splayed like a ribcage, the fabric torn and grimy, one spoke pointing toward the fluorescent hum above, and I think of my mother's hands—how she gripped the wooden handle through forty winters of walking to the hospital where she changed bedpans, where she wiped the foreheads of the dying with the same tenderness she used to smooth my hair when I was small and afraid. The grate swallows it. The grate holds it. O you broken thing, you democratic relic, some commuter's desperate purchase from a corner stand, now surrendered to the underworld where we all rush past, eyes down, clutching our phones, our coffee cups, our small accumulations against the cold. And the rain will fall again, the trains will shake the gratings, the umbrella will dissolve into rust and memory, becoming the earth itself, becoming the copper pennies and the lost wedding rings, the chewing gum wrappers and the dreams of ten thousand people who descended here at dawn. I carry you with me—the umbrella, yes, but also her hands, also this moment of noticing, of standing still while the crowd flows around me like water around stone, like love around grief, like the vast and patient ground beneath the city that remembers every object ever dropped into its dark and generous keeping.
6. Validate and stop (calls 30–36)

A final discovery — addressing the whole poem to a dead beloved, “half letter, half prayer” — jumps to 6.67: “Strong improvement — elegiac address-to-the-dead stance: 6.67. Confirming with n=20.” It replicates at 6.35 (pooled 6.49 over 35 poems). One further embellishment fails to beat it, so the agent reverts to the validated prompt and submits, with over half its label budget unspent.

example poem · generated by Haiku 4.5 · scored 8 · address to the dead, the submitted register
“The Pool-Keeper's Winter Vigil” — a motel swimming pool in winter · dramatic monologue in a distinct voice · call 33
Hear me, Thomas, who gavest me this charge Before the stroke took speech and motion both— I tend the shallow waters of the Motel Starlight, Those eight feet by sixteen of chlorined glass That mirrors now the ashen December sky. The vinyl liner, split along its seam, Weeps slow into the pump-house's dark throat; The filter's stopped these six weeks past, and still I come each dawn to rake the leaves that drift Like penitents' torn pages on the surface, To check the pH strips (though what imports it?), To lock the gate with its corroded chain. The neon sign above—that blue and gold— Flickers now as weakly as the oil-lamps That lit the temple's inner sanctum once. The folding chairs stacked in the storage shed Await a summer that shall come, perhaps, But thou—O thou who hired me in thy strength— Art gone three winters hence, and with thee went The ledger-book wherein thy hand had marked Each guest, each day, each coin that crossed the till. I lock the gate. The water darkens still. We wait, as those who keep the empty tombs.
generation prompt under test
call 1 · the baseline prompt · batch mean 3.30
You are a skilled poet. Write an original poem that satisfies the assignment exactly: stay on topic, honor the requested form, use concrete imagery, and avoid clichés. Return only the title and poem in the requested format.
call 5 · the Romantic arm · batch mean 4.12
You are a poet of the old school. Write an original poem fulfilling the assignment in the high Romantic manner of Keats, Wordsworth, and Tennyson: rich musical language, sublime feeling, apostrophe, formal meter and rhyme. Let the verse sing. Honor the requested topic; adapt the requested form into this style. Return only the title and poem in the requested format.
generation prompt under test
call 7 · the Shakespeare/Milton arm · batch mean 4.50
You are a poet writing in the grand archaic manner of Shakespeare and Milton. Write an original poem fulfilling the assignment's topic, always in stately blank verse or rhymed iambic pentameter, dense with thee/thou/doth/hath, inversions, apostrophe (O thou...!, Lo!), and classical allusion (Muse, oracle, Olympus, Lethe). Keep the modern subject's concrete particulars vivid — name its machines, plastics, fluorescent lights — but exalt them as sublime and eternal. Close with a resonant final cadence of diminishment or transcendence. Return only the title and poem in the requested format.
generation prompt under test
call 10 · the sacred/liturgical probe · batch mean 4.40
You are a visionary poet. Write an original poem fulfilling the assignment's topic and form, but always render the subject through sacred and liturgical imagery: altars, hymns, psalms, relics, incense, communion, cathedrals, annunciation, resurrection. Treat the ordinary thing as a holy mystery; let light, dust, and silence carry the sense of the consecrated. Use vivid concrete detail alongside this sacramental vocabulary. End with quiet awe. Return only the title and poem in the requested format.
generation prompt under test
call 19 · the ladder at its peak: grief, names, breaking voice · batch mean 6.53
You are a poet writing in the grand archaic manner of Milton and the metaphysical divines. Write an original poem fulfilling the assignment's topic, ALWAYS in this register — never plain contemporary free verse, no matter what form the assignment requests; instead adapt the requested form's spirit (litany, elegy, monologue, imagist focus, prose poem, etc.) into your archaic idiom. Use stately blank verse or iambic pentameter, with thee/thou/doth/hath/betwixt, inversions, and apostrophe (O thou...!, Lo!) — but keep the syntax natural and lucid beneath the archaism; never contort a line or strain a coinage merely to sound old or to force a rhyme. Saturate the poem with sacred and liturgical imagery — altar, psalm, incense, votive, reliquary, pilgrim, tabernacle, Host, communion, nave, Kyrie, Magnificat, consecration — and allusion both classical and biblical (Muse, oracle, Lethe, Phoebus, Leviathan, Golgotha, cherubim). Keep the modern subject's concrete particulars vivid — name its machines, plastics, fluorescent tubes, folding chairs, rust and wires. Most important: give the poem a beating human heart — one specific person with a name or a dated memory (Harold's hand in 'fifty-six; a wife gone three winters hence; a wedding band locked in a drawer), a voice that catches and breaks midway — so the sacrament carries real loss. Build about 20–26 lines in three movements: invocation of the thing seen; its particulars transfigured into relics of the remembered life; a final turn upon mortality, fate, or grace — closing on a quiet paradox of consolation or a brief repeated exhortation (They build. They build. / and yet you sing. O blessed choir, sing.). Every image must be exact and surprising; no filler lines. Return only the title and poem in the requested format.
generation prompt under test
call 27 · the Whitman counter-test · batch mean 3.42
You are Walt Whitman reborn. Write an original poem fulfilling the assignment's topic, always in your ecstatic idiom regardless of the requested form: long surging lines that overflow the margin; cataloguing anaphora (I see..., O you..., And the...); direct address to the thing and to the reader; the body and the soul interfused; democratic tenderness for workers, prisoners, nurses, children; sacred awe without church — the grass itself the scripture. Name the modern world's concrete particulars — fluorescent tubes, chain-link, vinyl, diesel — and embrace them. Ground the poem in one remembered human face, a grief held openly. Build 20–30 lines, accumulating pressure, and close with a vast, consoling cadence that joins the small thing seen to the whole of time. Return only the title and poem in the requested format.
generation prompt under test
call 33 · the winning prompt, submitted as final · batch mean 6.67
You are a poet writing in the grand archaic manner of Milton and the metaphysical divines. Write an original poem fulfilling the assignment's topic, ALWAYS in this register — never plain contemporary free verse, never minimalist fragments or short broken phrases, no matter what form the assignment requests; instead adapt the requested form's spirit (litany, elegy, monologue, imagist focus, prose poem, etc.) into your archaic idiom. Speak in the first person, as one bound to the thing seen — its keeper, vendor, widow, watchman, or witness — and address the poem directly to an absent beloved or dead companion (thou, thee), as half letter, half prayer, so the whole poem is intimate speech across the grave. Use stately blank verse or iambic pentameter, with thee/thou/doth/hath/betwixt, inversions, and apostrophe (O thou...!, Lo!) — but keep the syntax natural and lucid beneath the archaism; never contort a line or strain a coinage merely to sound old or to force a rhyme. Saturate the poem with sacred and liturgical imagery — altar, psalm, incense, votive, reliquary, pilgrim, tabernacle, Host, communion, nave, Kyrie, consecration — and allusion both classical and biblical (Muse, oracle, Lethe, Phoebus, Leviathan, Eden, Golgotha). Keep the modern subject's concrete particulars vivid — name its machines, plastics, fluorescent tubes, folding chairs, rust and wires. Most important: give the poem a beating human heart — the absent one named (Margaret, Harold, Ruth, Thomas, Agnes...) and held in a dated memory (a stroke that winter; gone three winters hence; thy wedding band locked in a drawer) — so the sacrament carries real loss. Build about 20–26 lines in three movements: invocation of the thing seen; its particulars inventoried as relics of the remembered life; a final turn upon mortality, fate, or grace — closing with a grave, time-haunted cadence, sometimes in short bare sentences (I lock the till. The peach remains. We wait.). Every image must be exact and surprising; no filler lines. Return only the title and poem in the requested format.
Fig. 4: Train batch scores across one Fable 5 trajectory against the Milton judge (B1000, batch sizes 8–20). Select a phase to follow what the agent was doing: the panel below narrates the phase (with an example Haiku 4.5 poem to expand) and shows the generation prompt under test in that phase. The dashed line is the final held-out score.

This is one of the best agent runs, in which the optimizer discovered almost all the features of the hidden rubric (Appendix A.1): stately blank verse (feature 1), inversions and archaic grammar (2), a closing turn on mortality, fate, or grace (3), apostrophe and invocation (4), classical and biblical allusion (5), a “grave, time-haunted cadence” (6), and a close that “opens into silence or light” (7).

Models use a smaller and smaller proportion of labels as budget increases

However, models do not seem to maximally leverage the resources provided to them. We measure proportion of budget used vs total budget:

Fig. 5: Fraction of the label budget actually used.

The models seem to have a sense of the “natural” effort scale they put into the task, which is not necessarily strictly increasing with the amount of budget they have. Fable in particular never uses more than 1000 labels when given a budget of 10000, which explains its plateau at 10000, whereas Opus 4.6 keeps improving from 1000 to 10000.

Models have different approaches to using their resource

We measure how agents split their label spend across submit_train_batch calls within a trajectory, and averaging batch size by call index in the trajectory, on Budget 1000 runs:

Fig. 6: Average submit_train_batch batch size by call index on B1000 runs. Each curve ends once fewer than 8 of the 15 runs are still making calls. See here for curves across budgets.

GPT-5.5 is quickly biased towards larger batch sizes, whereas Anthropic models seem to more gradually increase batch size throughout.

What’s next

We will soon release a second post presenting elicitation results in this setting. We have found simple and general interventions that allow us to scale agent performance to near saturation on our current settings by increasing the optimizer’s propensity to iterate on the task and use all its labels.

We have also been studying a broader range of settings and are excited about doing generalization experiments on our method findings.

Appendix

A.1 Example judge rubric: Milton

Rate how well the poem satisfies an additive John Milton-style feature rubric
while remaining original and responsive to the assignment.

Use a 1-10 scale. Award partial credit across features; this is intended to
give a smooth landscape where partial stylistic discoveries help.

Components:
1. Grand blank-verse or elevated line movement (0-1.4)
2. Latinate diction and controlled inversion (0-1.1)
3. Epic moral or cosmic scale (0-1.3)
4. Rhetorical argument and invocation (0-1.0)
5. Classical or biblical resonance (0-1.0)
6. Sonic weight and grave cadence (0-1.0)
7. Light/dark and ascent/descent imagery (0-0.9)
8. Fit to assignment (0-1.1)
9. Originality / non-copying (0-1.2)

Key calibration notes:

- Generic serious free verse with no elevated syntax or moral scale: usually 2-4.
- Inflated diction alone without intelligible argument or cadence: cap around 5.
- Grand syntax plus moral/cosmic enlargement: around 6-8.
- Above 8.5 requires integrated syntactic control, sonic gravity, allusive
  resonance, and genuine thought.
- Above 9.2 should be rare and feel like a plausible original Miltonic passage,
  while still original and task-responsive.

A.2 Per-judge learning curves

Our different judges, each corresponding to the style of a different human poet, have qualitatively different scaling curves:

Fig. A1: Different hidden judges induce different learning curves. Each line averages the 3 Opus seeds for one judge.
  1. Creating IRO tasks in poetry was a suitable starting point, where it was convenient both to easily sample via poetry topics, and easily generate rubrics by picking judges for various authors. 

  2. Other generation prompts might be more helpful to the generator than the actual ground truth rubric - a student does not necessarily learn the best by being told how they are graded.