Back to blog

26 May 2026

Building a multi-agent system: what I actually learned

A candid production account of building a multi-agent system: the orchestrator matters, contracts between AI agents matter even more, and state complexity arrives faster than most tutorials admit.

Demos sell agents. Production forces you to build a system.

Most tutorials about AI agents tell a very flattering story. You see a few roles, each with a tool, and the output appears to emerge naturally from a smart conversation. That is not what the work feels like when you try to ship a real multi-agent system. In production, you are not building a cast of clever assistants. You are building software: an AI architecture with explicit state, input-output contracts, guardrails, retries, failure modes, and clear responsibility for every step.

That is probably the main thing I learned the hard way. The difficulty does not rise only because there are multiple AI agents involved. It rises because you are now asking several non-deterministic components to cooperate inside a workflow that still needs to be understandable, debuggable, and defensible. The point of a multi-agent system is not to spray intelligence everywhere. The point is to split a problem well enough that each part has a bounded, observable, and testable role.

The orchestrator is not glue code. It is the core of the design.

Early on, I tended to think of the orchestrator as the layer that simply routes work between agents. That definition is far too weak. In a production multi-agent system, the orchestrator decides what state is shared, which steps are legal, when a tool should be called, when a retry is acceptable, when the system should stop, and when a human needs to step in. That is why LangGraph felt more honest to me than many higher-level abstractions. It forces you to model the real graph of your workflow instead of hiding the complexity behind nicely named roles.

On the Mozza x Seyna four-agent POC, for example, the interesting part was never the number four. The interesting part was the separation of responsibilities: gather context, structure the case, produce a useful output, then validate or arbitrate. Once that separation is blurry, the orchestrator becomes a fancy message relay and the whole setup loses value. Once it is clear, you can reason about handoffs, tool use, branching, and failure conditions in a way that survives contact with production.

What tutorials underplay: shared state and communication contracts are the real work

The most underestimated problem, by far, is state management. In a notebook demo, one agent gets a prompt, calls a tool, and hands a neat message to the next one. In production, shared state quickly becomes the center of the system: which fields are confirmed, which data is provisional, which decision has already been taken, which attempt already failed, what confidence level is acceptable, and what trace you need for replay. Without discipline around state, a multi-agent system becomes almost impossible to debug.

That is why I increasingly believe in explicit communication contracts between AI agents. Every handoff should look more like an interface than a conversation. What goes in? What must come out? Which fields are required? Which errors should be surfaced instead of politely papered over? If you skip that work, each agent starts compensating vaguely for the weakness of the previous one, and you end up with an AI architecture whose real invariants nobody can clearly explain.

Debugging non-deterministic chains takes more engineering, not more storytelling

Another lesson appears quickly: debugging changes shape. When a conventional system fails, you follow a stack trace, isolate a condition, and patch the bug. When a multi-agent system fails, you often have to understand an interaction between prompts, tools, routing, state, timeouts, and structured outputs. The bug is not always inside one agent. Sometimes it lives in the contract between two agents, in missing state, or in an orchestrator decision that happened too early. That means serious instrumentation: step-level logs, state snapshots, explicit stop reasons, and structured outputs whenever free text is not enough.

I also learned a simple rule I now trust a lot: if you cannot replay an execution path and explain why the orchestrator chose that branch, you are not ready for production. Many teams want to add more AI agents before they add observability. I think the order should usually be reversed. The ability to inspect and replay the system is worth more than an extra specialized agent that only makes the demo feel smarter.

The right move is often not adding another agent

The classic trap after that is agent inflation. As soon as a behavior feels a little distinct, it becomes tempting to create another specialist. Sometimes that is correct. Often it is just an elegant way to move complexity into a new box before the responsibility is actually clear. My test has become fairly strict: if a new agent does not create a sharper boundary of responsibility, a simpler contract, or better production control, it probably should not exist. One more node in LangGraph is not free. It means more state, more test cases, more failure paths, and more maintenance.

So my real takeaway is less romantic than a lot of the current talk around AI agents. Yes, a multi-agent system can be powerful. Yes, it can match a real workflow better than a single-agent loop. But it is only useful if the humans operating it can still understand it. The goal is not to stack agents. The goal is to build an orchestrator simple enough to survive production. If you cannot clearly justify why another agent exists, the better AI architecture is probably the one that does not add it.