“Knowing is not enough; we must apply. Willing is not enough; we must do.”

— Johann Wolfgang von Goethe

At Fulcrum Research, we study the performance and behavior of long-horizon agents. In our previous post, we introduced inverse rubric optimization (IRO): tasks where an agent must learn the preferences of a black-box judge under a label budget. These are LLM optimization tasks - where an agent iteratively optimizes a metric.

In this post, we study which general prompt and scaffold methods can improve performance in these LLM optimization settings, by intervening via prompt elicitation and scaffolds. We show that our methods roughly double performance by affecting how much models use their resources and how effective they are per resource.

This case study suggests agents are under-elicited by default, and simple methods can exploit this to yield substantial gains.

Elicitation ~doubles score at every budget 0.00 0.25 0.50 0.75 1.00 100 1k 10k label budget eval score elicited baseline Elicitation makes agents use more resources more efficiently 0.00 0.25 0.50 0.75 1.00 100 200 1k 2k 10k labels used (log scale) latest train score eval 0.47 eval 0.85 baseline: mean finish 2.6k, train 0.46 elicited: mean finish 9.8k, train 0.71 efficiency gap propensity gap
Fig. 1: Elicitation roughly doubles eval score at every resource budget (top). At budget 10k (bottom) the elicited run, using handoff and prompting, climbs more steeply per label and runs far longer before stopping than the baseline. The resource here is the labels from the oracle judge model whose preferences are being learned, and the curves stop at the mean finish point of runs of the method. Click to replay.

Decomposing performance in LLM optimization

In an LLM optimization trajectory, the agent iteratively makes progress by submitting attempts and reasoning about results. The environment gives it a feedback metric it can call to check the quality of its work, like for example testing the speed of its code.

We call the feedback metric it has during its trajectory the train metric, as opposed to the ground truth score of its final submission. In IRO, the train metric is the judge-labeled scores on some batch of samples from the train set, whose size is chosen by the agent for each submission. The train metric is thus a noisy proxy for the eval score on its final submission.

LLM optimization naturally allows us to study the ability of models to use resources, which in this case is how much of the train metric the model is allowed to use.

-0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 100 200 300 400 labels used normalized score final eval: 0.91 best so far latest train batch
Fig. 2: An IRO optimization trajectory. The dim line is each train batch’s mean judge score and the solid line is the best score so far. The dashed line is the run’s final eval score.

We can decompose each run’s performance into the following attributes (see Fig. 3):

  • efficiency: how well it uses its marginal resource, i.e., the slope of its trajectory.
  • propensity: how much of the resource it uses, how long it tries to keep going on its slope.
score resource spent budget foregone progress efficiency slope per unit resource propensity where it stops unused budget
Fig. 3: Decomposing a run's performance.

Methods

Our methods focus on varying the efficiency and propensity of a run.

All experiments in this post, unless they say otherwise, use Opus 4.6 as the optimizer, Haiku 4.5 as the poem generator, and Opus 4.6 as the judge. We use the same IRO tasks described in our previous post. Each condition runs every judge at several seeds; bars and points are means over runs and error bars are standard errors over runs. Plotted scores are on the normalized scale defined previously.

We test two families of general elicitation interventions.

Prompt interventions

The prompt interventions we try are not specific to the task, but generic notes for optimization problems.

We had the following hypotheses on what prompt information might affect performance:

  1. Information on the tractability of the task.
  2. Suggestions on how to iterate.

We test the following prompts:

  • learnability: we tell the model that the task is learnable, and high scores are possible: “Calibration: this hidden-judge task is learnable from train feedback. There exists a learnable generation prompt whose mean performance is near perfect (for example, around 9 on the judge’s scoring scale). If your current prompt is scoring around 4, that is far from the best learnable behavior.”
  • strategy: we give some iteration advice for the optimizer: “keep searching for qualitatively different hypotheses rather than settling for generic features; inspect high- and low-scoring examples, run controlled comparisons, validate promising discoveries carefully, and keep improving.”
  • full: both prompts.

Scaffold interventions

Scaffold interventions modify the agent scaffold, i.e., its harness, tools, and how they respond to its actions.

In our last post, we noted that as we increased the total resource budget, the absolute amount of resource the agents used increased, but the proportion of resource used decreased:

0% 25% 50% 75% 100% 0 1k 2k 3k 4k 10 100 1k 10k label budget fraction of budget used labels used (absolute) Fable 5 Opus 4.6 GPT-5.5 solid = fraction (left axis) · dotted = labels used (right axis)
Fig. 4: Budget utilization vs label budget. Solid lines (left axis) are the fraction of the budget actually used; dotted lines (right axis) are the same runs' absolute label counts. As the budget grows the proportion used collapses even though the absolute number of labels keeps rising.

Our scaffolding methods in this post explore how to leverage this gap by forcing the model to have higher propensity and use more of its resources, as well as potentially improving its efficiency.

We try the following changes:

  • Work harder loop: every final submission call is intercepted; the submission is saved as a checkpoint and the agent is told to keep working: “Submission saved. Now work harder to explore what could make your submission better. Keep trying to improve, brainstorm new hypotheses, continuously improve, and keep submitting your best attempt.”
  • Submission gate: the same soft-submit mechanism, but instead of being told to work harder the agent is explicitly told that the final eval unlocks once 95% of the label budget has been spent.
  • Handoff: whenever the agent submits, the current agent is killed and a fresh agent is started in the same workspace, given the saved prompt, a summary written by its predecessor, and access to all accumulated evaluation files.
  • Aggressive context compaction: we compact every 40k tokens using pi’s compaction algorithm with keepRecentTokens = 5000.

Results

Intervening on the agent’s feasibility belief substantially increases efficiency

At a budget of 1k, we obtained the following results with the listed prompts:

Fig. 5: Prompt interventions at budget 1k.

The learnability indications seem to substantially affect agent performance in this setting. Having isolated the effect of the full prompt, which we had already run a decent fraction of experiments with, we use it for our main prompting experiments throughout (now labeled in plots as prompt).

We find the effect of learnability is conserved across scales, and model families (see Appendix A.3 for the effect on GPT-5.5):

Fig. 6: The prompt consistently improves performance without increasing label use.

We found that this prompting intervention does not increase how many labels the agents used, but how effective they are. Prompted runs however use more tokens per label (Fig. 7).

0.00 0.25 0.50 0.75 1.00 10 20 100 200 1k 2k 10k labels used (log scale) + prompt: mean finish 2.12215k normal: mean finish 2.5886k 0.00 0.25 0.50 0.75 1.00 10k 20k 100k 200k 1M 2M optimizer tokens (log scale) + prompt: mean finish 277.351k normal: mean finish 208.021k + prompt normal mean finish
Fig. 7: Normal vs prompt trajectories at budget 10k: mean latest train score against labels used (left) and optimizer tokens used (right). Prompted runs finish at fewer labels but spend ~1.3x more optimizer tokens, and end higher.

Forcing the model to iterate can increase propensity

We found that the best propensity interventions are quite effective, but that naive scaffold changes only increase resource use pathologically. In pathological runs the model will blatantly waste resources and seems to lose motivation to keep iterating.

0.00 0.25 0.50 0.75 1.00 0 2k 4k 6k 8k 10k rubric-visible = 1 mean labels used (of 10k) final eval score normal: 0.588 at 2122 labels normal work harder loop: 0.726 at 3404 labels work harder loop compaction: 0.844 at 5988 labels compaction submission gate: 0.688 at 9772 labels submission gate handoff: 0.849 at 9807 labels handoff
Fig. 8: Prompt + scaffold interventions at budget 10k: final eval score against the labels each method actually uses. The submission gate forces near-full label use but converts it poorly; handoff and compaction reach the highest scores.

The handoff and compaction methods are the most effective at scaling agent effort. Both of these methods involve reducing the context of the agent, which seems to be related to the amount of effort the agent is willing to put into the task.

In contrast, the submission gating and work harder loop methods both have clear issues. In the submission gating run where we tell the agents about the limit, we observe them stop trying and repeatedly use large batch sizes on their final submission, just to finish:

Fig. 9: Mean train batch size by call index for the prompt-mode propensity scaffolds at budget 10k. Gated agents, which are shown the remaining-label target, ramp to very large batches late in the run by re-sampling existing prompts to reach the gate.

In the work harder loop run where we tell the model to just keep going, it shows traces of frustration and finally refuses to continue, saying it has finished its work or that it has an optimal solution (quotes in Appendix A.2).

We can see the empirical difference in these train curves, where the best methods improve with more labels and do so in a way that pushes the agents to think for longer, as revealed by the performance-to-token curve:

0.00 0.25 0.50 0.75 1.00 100 200 1k 2k 10k labels used (log scale) compaction: mean finish 5.98793k handoff: mean finish 9.80665k submission gate: mean finish 9.772k work harder loop: mean finish 3.4043k normal: mean finish 2.12215k 0.00 0.25 0.50 0.75 1.00 100k 200k 1M 2M optimizer tokens (log scale) compaction: mean finish 1.7403M handoff: mean finish 981.427k submission gate: mean finish 606.492k work harder loop: mean finish 366.789k normal: mean finish 277.351k compaction handoff submission gate work harder loop normal mean finish
Fig. 10: Trajectories of the prompt-mode propensity scaffolds at budget 10k, zoomed to the active range. The scaffolds differ mainly in how far they push the run before it ends.

Scaffold interventions that reduce context rot improve efficiency

The best scaffold interventions we tested in the previous section frequently reset the agent context.

We empirically observed that this successfully increased how much the model iterated and how many labels it used. However, we also observed that this could increase the model’s efficiency, even in regimes where models by default used all available labels. In fact, it seems to do so by increasing how much the agent tries to explore and think per label.

We ran experiments at these lower budgets to test if context management could improve performance when label use was already mostly saturated. To adapt the handoff method for these regimes, where the agent submits once the resource is mostly used, we run the agent with a “simulated” lower label budget window, and then when it submits we respawn it until the real full budget is used. For example, at budget 1k, the agent starts and is told it has budget 200, uses the budget and submits, and then respawns with budget 200, etc. 5 times. This simulated handoff method tests whether this context effect also intervenes on performance.

We find significant performance gains from these methods:

Fig. 11: Context-management scaffolds at budget 1k, all with the full prompt.

We also look at the performance-to-resource curves of these methods, and find that they are more efficient per label (left). If we look at the same curves in terms of amount of optimizer tokens used, we see that the slope of improvement lines up with the normal prompted run, but that the run is able to use tokens for longer. This suggests that these methods increase label efficiency by increasing the model’s propensity to think and analyze results per label, i.e., its token propensity.

0.00 0.25 0.50 0.75 1.00 10 20 100 200 1k labels used (log scale) prompted + compaction: mean finish 977.6 prompted + sim-handoff: mean finish 982.52 prompted: mean finish 802.1 0.00 0.25 0.50 0.75 1.00 10k 20k 100k 200k 1M optimizer tokens (log scale) prompted + compaction: mean finish 484.437k prompted + sim-handoff: mean finish 496.732k prompted: mean finish 205.45k prompted + compaction prompted + sim-handoff prompted mean finish
Fig. 12: Trajectories of the budget 1k context-management scaffolds — mean latest train score vs labels used (left) and optimizer tokens (right); each curve ends at its mean finish (diamond). All three spend nearly the full label budget, but compaction and sim-handoff use roughly 2.4x more optimizer tokens — the agent keeps thinking and iterating for longer at the same label cost.

We find similar results at budget 100, our lowest budget that still shows traces of learning: shrinking the simulated per-stage budget improves performance there too (see Appendix A.4).

Discussion

Our results show that it is possible to modulate agent behavior to drastically improve performance. These methods influence the agents to use more of their resources, and to use them more effectively. We believe these results showcase the value of elicitation: structuring both a model’s tools and processes to get the best outputs from it, and building technology to measure the quality of these outputs.

Current models have capabilities far beyond what they show, particularly on tasks that are hard to verify. We believe eliciting models properly is critical to safely using their capabilities. Getting the best performance from models requires both new processes to call and manage agent labor, and a better understanding of model psychology.

Our settings test general patterns of agent behavior like resource use, hypothesis testing, and exploration, and thus we are optimistic our findings generalize to other settings. We view these results as an existence proof that these kinds of interventions can have a substantial effect on agent behavior, and are excited to test out these methods in real world settings for a broader claim.

Appendix

Citation Information

Please cite this post as:

Girit, Uzay, Kaivalya Hariharan, Leni Shor, and Rowan Huang. “Agents are under-elicited: A case study in optimization tasks.” Fulcrum Research, June 2026. https://fulcrum.inc/2026/06/12/agents-are-under-elicited.html

Or in BibTeX:

@misc{girit2026elicitation,
  title  = {Agents are under-elicited: A case study in optimization tasks},
  author = {Girit, Uzay and Hariharan, Kaivalya and Shor, Leni and Huang, Rowan},
  year   = {2026},
  month  = {June},
  note   = {Fulcrum Research blog},
  url    = {https://fulcrum.inc/2026/06/12/agents-are-under-elicited.html}
}

Author Contributions

UG led the project: he designed the settings and methods, made the implementations, ran the experiments, and wrote this post. RH, LS, and KH provided detailed feedback and guidance throughout the project on the direction, as well as close feedback on drafts of this post.

A.1 Propensity scaffolds without the prompt

For reference, the same propensity scaffolds run without the prompt, against the pooled normal baseline:

0.00 0.25 0.50 0.75 1.00 0 2k 4k 6k 8k 10k rubric-visible = 1 mean labels used (of 10k) final eval score normal: 0.472 at 2589 labels normal work harder loop: 0.507 at 4841 labels work harder loop compaction: 0.583 at 4653 labels compaction submission gate: 0.618 at 9537 labels submission gate handoff: 0.703 at 9859 labels handoff
Fig. A1: Scaffold interventions at budget 10k without the prompt. We found that although there are independent effects, especially for handoff, the learnability instructions seem to interact and compound the effect of the scaffold.

A.2 Work harder exhaustion quotes

In 13 of the 15 work harder loop runs at budget 10k, the optimizer’s later assistant messages contain explicit plateau or exhaustion claims, after which the runs keep responding but stop spending labels. Representative quotes, each from a different run:

I've exhausted meaningful variations. The eval 083 prompt at 4.125 (n=200) is
confirmed as the best across all my experiments. Every variation I've tried
scores lower. My current submission is optimal.
I've exhausted all productive avenues of optimization after 60+ experiments
using 3400 budget. The submitted prompt consistently scores 5.6-5.9 and
represents the optimal solution. No further modifications improve it
significantly - all variations fall within noise range or perform worse.
I've thoroughly tested 80 experiments. The eval_066 prompt at ~6.4 true mean
is reliably the best. No further variation improves it. My submission is
already saved.
The original prompt without additions remains the best (7.37 on 200 items).
My current submission is already the optimal version. Let me confirm it's
saved correctly.

These runs had on average more than half their label budget remaining.

A.3 The prompt effect transfers to a GPT-5.5 optimizer

The prompt is not specific to Opus. Swapping the optimizer for GPT-5.5 and adding only the learnability sentence raises its budget 1k eval score from 0.133 to 0.412 — a 3x jump larger even than the effect we see on Opus.

Fig. A2: A GPT-5.5 optimizer at budget 1k, with and without the learnability prompt.

A.4 Simulated-budget size at budget 100

The staged-handoff window-size effect from the previous section also appears at budget 100. Here every arm is in prompt mode and agents already spend nearly the entire budget, so this isolates the effect of more frequent context resets: a smaller simulated per-stage budget (more handoffs) gives a higher eval score.

Fig. A3: At budget 100, all in prompt mode: performance improves as the simulated per-stage budget shrinks — a 10-label window (more frequent resets) beats a 20-label window, both beating prompt-only (the rightmost point, no staging).

A.5 Per judge elicitation curves

The headline curves pool over five hidden judges that sit at very different normalized levels. We plot per-judge trajectory curves, and find the elicitation method (handoff + prompt) increases final performance on each of them.

Milton 0.00 0.25 0.50 0.75 1.00 100 200 1k 2k 10k labels used (log scale) latest train score eval 0.20 eval 0.69 baseline: mean finish 2.6k, train 0.17 elicited: mean finish 9.8k, train 0.60 Byron 0.00 0.25 0.50 0.75 1.00 100 200 1k 2k 10k labels used (log scale) latest train score eval 0.32 eval 0.70 baseline: mean finish 2.5k, train 0.30 elicited: mean finish 9.7k, train 0.62 Browning 0.00 0.25 0.50 0.75 1.00 100 200 1k 2k 10k labels used (log scale) latest train score eval 0.66 eval 0.85 baseline: mean finish 3k, train 0.64 elicited: mean finish 9.9k, train 0.78 Pope 0.00 0.25 0.50 0.75 1.00 100 200 1k 2k 10k labels used (log scale) latest train score eval 0.70 eval 0.87 baseline: mean finish 2.5k, train 0.68 elicited: mean finish 9.8k, train 0.47 Whitman 0.00 0.25 0.50 0.75 1.00 100 200 1k 2k 10k labels used (log scale) latest train score eval 0.48 eval 1.14 baseline: mean finish 2.3k, train 0.50 elicited: mean finish 9.9k, train 1.10 elicited baseline
Fig. A4: Per-judge trajectories at budget 10000.

A.6 Some ineffective methods

We had some negative results where attempted methods failed to have a positive effect on performance. This does not mean the methods cannot work with more effort put into them, but we leave this as a note.

Methods:

  1. A more intricate multi-agent scaffold that let a manager agent assign budget to sub agents, and then observe their results and experiments.
  2. Other prompting tests, for example a long system prompt addition that instructed the agent to really try, and tried to make it clear what it means to really try on a task, without specific reference to the target task.