Multi-agent systems as distributed software

TLDR: Agents are being used with less and less human steering. But autonomous agents are instances of distributed systems with many complex failure cases. We advocate for a fault-tolerant and functional approach to designing these systems for reliability, drawing inspiration from Erlang. We bake this approach into a new version of our open source agent orchestration library called ramure - making it easy to write reliable agent software.

In the past, most agent use has concentrated around user-to-agent interactions where the user constantly steers and grounds the model’s actions. But models are now smart enough it makes sense to run agents fully autonomously, and have humans giving feedback only at higher level boundaries.

However, getting agents to work reliably without constant human oversight, is both an engineering challenge (i.e getting every component of this infrastructure working reliably) and a design hurdle (i.e designing processes that effectively use agent labor). Doing so will require more intentional system design focused on the points where agents pass off work and when humans intervene. These systems will also need to be able to self-correct.

Agent deployments are complex distributed systems

Swarms of agents that coordinate across environments represent a complex distributed system. Distributed systems are notoriously hard to get working well.

Agents are no exception, and in our work we’ve noticed issues like:

timing errors in terms of when agents provision, when they get certain events, when they coordinate vs go idle
one agent in a system of agents tolerates a lower standard of success that ultimately makes the whole system worse, like relaxing test constraints
a lack of isolation across the roles and responsibilities of agents, such that they confuse each other and slow down the system, or use affordances they weren’t supposed to

Many of these kinds of issues have analogs in normal software, but an added complexity with agents is that they are expected to do much more complicated things. You have to monitor for failure not only at the software level but also at a semantic level, i.e., watching whether these distributed components are accomplishing their roles as intended. The complexity comes from the fact that these activities – maintaining properties of the system, building features, preparing reports, and so on – are not easily checkable in pure code.

Luckily, we’ve found that concepts from distributed software, along with the ability of agents to check more and more fuzzy things, can get us a long way to reliably running many agents usefully.

A detour via Erlang

Erlang is a concurrent functional programming language created for telecom applications, and its design is optimized for high concurrency and high reliability. WhatsApp and Discord famously ran on Erlang infrastructure.

Erlang is structured around concurrency — programs in Erlang work by defining and spawning processes that run at the same time. In Erlang:

Processes are very cheap to start and stop
Processes can communicate with each other, and run in parallel to accomplish tasks.
Processes can be monitored by other processes.
Processes are easy to debug and rerun.

This design encourages concurrency, modularity, and a baked in fault-tolerance where each computation should signal a potential failure clearly enough that other processes can patch it.

Agents should compose as fault-tolerant retriable processes

We can define a similar notion of an agent process (AP), a set of agents working and collaborating on machines with a shared lifecycle. We have found the following features useful to design good agent software:

APs are very cheap to start and stop
APs can communicate with each other, and run in parallel to accomplish tasks.
APs can be monitored by other APs.
APs are easy to debug and rerun.

When we want to design actual systems that use agents, rather than just having them interact and provision in unstructured ways, we need to be mindful of the purpose of various components, what it would mean for them to succeed, and how the system can recover if they don’t.

To make this concrete: Let’s say we want to make a worker pool AP. The worker pool defines a stack of worker agents, each running on a single task. The pool exposes methods to spawn/kill agents as well as audit the status of their work. Each worker is given some tools to either submit, and another agent or program can add_tasks and audit the state of the worker pool.

When we design this AP, we want to be mindful of all the ways it can fail:

basic software failures — agent connections or machines dropping, etc.
timeouts where agents don’t navigate to the next step, like a worker doesn’t submit , our worker pool should make it easy to monitor and emit events for this.
- one trick we might use is to also give workers a fail event to encourage them to indicate completion even in failure cases
maybe we want to encode a limit in the way the AP can be used, like there is a task cap beyond which it becomes hard to manage, so we want to emit and encode this kind of boundary

Each of these should be transparent to a program calling this worker pool, and it should be easy to retry based on the information. For example, if an agent fails either by calling fail with a reason or timing out, we want to expose the failure as well as the original spec and the machine state at time of failure, so we can easily respawn workers on the task.

Thinking carefully about these error interfaces allows us trust each component will accomplish its purpose. That reliability is essential when designing systems with many components that could each fail independently.

ramure - the open agent runtime

ramure is our new agent runtime that directly applies these ideas. ramure is a library that makes it easy to define and deploy agent flows in arbitrary software environments. ramure provides primitives that are better for building reliable agent software.

The central object of ramure is the agent_process (AP), which is defined by decorating a function with @agent_process decorator. Inside the function, you can define agents and machines as well as how they should communicate.

When a decorated AP gets called, the background runtime is initialized, which is then responsible for the lifecycle of the agents and machines. To control the lifecycle of an AP, you can define events via @agent.on that agents can call in your code.

Here is a single worker program for example:

import asyncio
from ramure import agent, agent_process, done, fail, wait

@agent_process
async def start_worker(task_id: str, spec: str) -> str:
    # initialize an agent (locally or on a remote sandbox)
    worker = await agent(f"worker-{task_id}")

    # register tools the agent can call in-harness
    @worker.on("finish")
    async def on_finish(summary: str) -> str:
        """Call with your result when the task is done."""
        done(summary)
        return "Recorded."

    @worker.on("give_up")
    async def on_give_up(reason: str) -> str:
        """Call if you cannot complete the task."""
        fail(f"gave_up: {reason}")
        return "Recorded."

    await worker.send(
        f"Task {task_id}:\n\n{spec}\n\n"
        "When done, call finish(summary). "
        "If impossible, call give_up(reason)."
    )
    # wait for the done/fail lifecycle from agent events
    return await wait()

if __name__ == "__main__":
    spec = "Check for bugs introduced in PR #17."
    print(asyncio.run(start_worker("t0", spec)))

Here, our AP is a simple task-worker agent, that has a finish event and a way to give up. Structuring how information moves in your program makes it easier to reliably use agent labor, especially in more complex cases. You also configure which image the agent runs from - on your computer, a docker image, or a remote sandbox image. The agent runs via the pi harness, and can be viewed in tmux.

APs compose. An AP can call another AP the way you’d call any async function to get its final output or you can get a handle to its async event stream. This handle is returned by spawn() . It holds on to all the events of the child AP and can react to its behavior as it’s running — retry if it fails or communicate with it. This allows us to have various functional components that we can put together in an agent execution.

APs can also encode specific ways in which they are interacted with, by exposing an API that can be called in code, or via another agent. To do this, we use the expose decorator, for example here to have a worker pool that you can add tasks to:

@agent_process(image=LocalImage())  # run on local
async def worker_pool() -> None:
    specs: dict[str, str] = {}

    # @expose lets a parent that spawns the worker pool
    # call this function later
    @expose
    async def add_task(spec: str) -> str:
        tid = f"t{len(specs):04d}"
        specs[tid] = spec
        emit("task_added", {"task_id": tid, "spec": spec})
        # bubble forwards the child's events to the parent,
        # tagged with `source` so callers can demux them
        bubble(spawn(start_worker, tid, spec), source=tid)
        return tid

    @expose
    async def tasks() -> dict[str, str]:
        return dict(specs)

    await wait()

You can then consume the exposed worker pool in various ways:

pool = spawn(worker_pool)

# call directly
await pool.call(
    "add_task",
    spec="Check for bugs introduced in PR #17.",
)

# or attach an agent — exposed functions become its tools
monitor = await agent(
    "monitor",
    system_prompt="You run a pool of workers.",
)
# the agent can now call the afforded interface
await pool.attach(monitor, prefix="pool_")

ramure provides simple primitives to express, deploy and observe instances of reliable agent software. It’s small and simple - less than 3k lines of code. Try it out with uv tool install ramure or pip install ramure.