Contents
- Agent Communication Is OS IPC
- The Five-Layer Decision Model
- L1 — Transport: How to send?
- L2 — Topology: Who talks to whom?
- L3 — Protocol: What format?
- L4 — Content Contract: What to send?
- Clearing Up a Common Confusion: MCP vs A2A
- Three-System Comparison: Claude Teams vs OpenClaw vs A2A
- Looking Back with the Model
In one line: Agent communication isn’t a new problem. OS IPC solved it decades ago, and the mapping is almost direct. Heads up: This article leans pretty heavily on OS design concepts. If you’d rather skip the technical depth, the next article’s hands-on walkthrough might be more intuitive to start with.
Last week, a former colleague posted an article in our company Slack analyzing Agent communication architectures. He broke down three systems — Claude Code Agent Teams, OpenClaw, and A2A Protocol — and then did something that immediately caught my eye: he built the entire analysis framework on top of OS IPC (Inter-Process Communication).
I stopped halfway through reading it, because I suddenly remembered that back in November I had looked into OS frameworks myself. At the time, while developing the Agent memory and communication framework, I had that feeling of “this seems like something I’ve read before, let me check” — but I looked it up, never dug in, and forgot about it. Seeing my former colleague’s article reconnected that thread. He has a CS degree, so his grasp of the concepts was a lot cleaner than mine. So I asked Em to ingest the article into the wiki and see if we could extract a usable analysis framework from it.
After Em ingested it, the wiki gained a new concept page: agent-communication-layers.md, which laid out a five-layer model. That model can be used to examine my own Agent Team’s communication design — which is what comes next (the following article). But before we get there, let me explain the model itself.
Agent Communication Is OS IPC
This isn’t a metaphor. It’s a direct mapping.
How do processes pass data to each other in an OS? Pipes, shared memory, message queues, file locks, signals — all things you touched when studying operating systems (even if you never imagined back then that you’d one day use them to design AI Agent communication architectures). In Agent systems, these concepts replay almost unchanged:
| OS Concept | Agent Equivalent | Example |
|---|---|---|
| Process | Agent | Each Claude Code teammate |
| OS kernel | Harness / Runtime | Claude Code, OpenClaw |
| fork() + exec() | Spawn sub-agent | Agent tool, TeamCreate |
| pipe (stdin/stdout) | Parent-child delegation | Subagent pattern |
| file + flock() | Filesystem mailbox | Claude Code Agent Teams |
| TCP socket | WebSocket | OpenClaw Gateway |
| HTTP/RPC | JSON-RPC | A2A Protocol |
| shared memory | Shared message pool | MetaGPT semantic layer |
| signal | Hook events | TeammateIdle, Stop hook |
| process state | Task lifecycle | pending → working → completed |
| Permission inheritance | Permission mode inheritance | Teammate inherits lead’s permission mode |
| context switch | Compaction + session summary | Agent rebuilds context from mailbox on wake |
Even the classic OS problems show up identically. Scheduling, memory isolation, permission inheritance — each has its counterpart challenge in Agent systems. But the one that hit me hardest was context switch: when an OS process wakes up, it has to rebuild state from registers; when an Agent wakes up, it has to read messages from the mailbox to rebuild its context. Those first few seconds at the start of a session, the Agent is essentially experiencing a TLB miss — it has to reload all the state that was scattered around before it can continue working. And the quality of that reconstruction directly affects every judgment it makes afterward.
So if you’re designing Agent-to-Agent communication, don’t start from scratch. OS IPC has been solving this for decades. You can stand directly on those shoulders.
The Five-Layer Decision Model
Since Agent communication is IPC, how do we decompose IPC design? My colleague’s article distills a five-layer model — bottom-up, where each layer’s choices constrain the options available to the next:
L4 Content Contract — what to send (memory selection + compression)
L3 Protocol — what format (message envelope, interaction pattern, task lifecycle)
L2 Topology — who talks to whom (hierarchy / star / peer / pub-sub)
L1 Transport — how to send (pipe / file / WebSocket / HTTP)
L0 Environment — where the Agents live (same process / machine / network / internet)
The bottom layer, L0 (Environment), usually isn’t something you choose — it’s determined by the harness. Agents running in the same process use function calls (Smolagents, LangGraph); same machine means file or pipe (Claude Code); across the network means WebSocket or HTTP (OpenClaw, A2A). My Agent Team all runs on a single Mac, so L0 = same machine, which directly narrows the options at every layer above it.
L1 — Transport: How to send?
| Transport | Latency | Coupling | Persistence | Debug friendliness |
|---|---|---|---|---|
| pipe (stdin/stdout) | Lowest | Highest (parent-child) | None | Low |
| file + flock() | Low | Low (any topology) | Yes | Highest (cat inbox/) |
| WebSocket | Low | Medium (needs server) | During connection | Medium |
| HTTP | Medium | Lowest (stateless) | None | Medium |
Lower latency means higher coupling; more independence means more overhead. This is exactly the same trade-off curve you face when choosing IPC mechanisms in an OS.
Claude Code Agent Teams chose file because they accurately assessed their environment: all Agents are guaranteed to be on the same machine, file I/O is easier to debug than any message queue, and you can just cat the inbox JSON to see what’s happening. That debug advantage is worth a lot in practice (I experienced this firsthand in the fifth article — when a silent failure hit, being able to read the inbox directly saved a lot of guesswork).
L2 — Topology: Who talks to whom?
| Topology | Control style | Bottleneck | Example |
|---|---|---|---|
| Hierarchy | Parent has full control | Parent | Subagent pattern |
| Star | Central coordinator | Coordinator | OpenClaw Gateway, my GM |
| Peer | No central control | None | A2A Protocol |
| Pub/Sub | Event-driven | None | MetaGPT message pool |
One important point: transport and topology are decoupled. WebSocket doesn’t imply star topology; HTTP doesn’t imply P2P. Choosing star is often about operational simplicity (not requiring each Agent to open its own port), not a technical constraint.
My Agent Team uses star (GM as central coordinator) + limited peer (Agents can communicate directly through the filesystem for core work), a hybrid topology that grew out of actual requirements: GM handles coordination and judgment, but Em and C7 doing core work don’t need to route through GM just to read each other’s files.
L3 — Protocol: What format?
Protocol defines three things: message envelope (sender, recipient, timestamp, message ID), interaction pattern (fire-and-forget / request-response / streaming / multi-turn), and task lifecycle (state machine: submitted → working → input-required → completed / failed).
Protocol complexity scales with trust boundary distance:
| System | Protocol thickness | Reason |
|---|---|---|
| Claude Teams | Thin (JSON-in-JSON) | Internal, same machine, trusted |
| OpenClaw | Medium (tool params + callback) | Same server, semi-trusted |
| A2A | Thick (JSON-RPC 2.0 + OAuth 2.0) | Open internet, untrusted |
Thin protocol internally, thick protocol externally — same system, different thickness for different scenarios. Consistent with the OS approach.
L4 — Content Contract: What to send?
The first four layers just get bytes from A to B. L4 determines whether the other side can actually use them.
What Agents send each other isn’t arbitrary text — it’s structured content that’s been through memory selection and compression. You can have a perfect transport layer, but if Agent A sends a massive raw context dump and Agent B’s context window overflows, reasoning quality collapses and communication fails.
L4 is the real deciding factor in whether Agent communication succeeds. In my former colleague’s words, it’s the layer every system currently handles most crudely. The JIT loading and token budget principles I discussed in the Context Engineering article should actually extend to inter-Agent communication as well: you wouldn’t stuff everything into a single Agent’s Context, and you shouldn’t stuff everything into a single inter-Agent message.
Clearing Up a Common Confusion: MCP vs A2A
These two are often compared together, but they solve fundamentally different problems:
| Dimension | MCP | A2A |
|---|---|---|
| Direction | Vertical (agent → tool / data) | Horizontal (agent → agent) |
| OS analogy | System call | Cross-process IPC |
| Purpose | Let Agents call external databases, APIs | Let Agents negotiate and delegate with other Agents |
The two are complementary, not competing. The official A2A spec explicitly states these are complementary protocols. An Agent can use MCP to call tools while simultaneously using A2A to communicate with another Agent — within the same workflow.
Three-System Comparison: Claude Teams vs OpenClaw vs A2A
Laying the three systems out against the five-layer model, the design choices at each layer become clear:
| Dimension | Claude Teams | OpenClaw | A2A |
|---|---|---|---|
| L0 Environment | Same machine | Same server | Open internet |
| L1 Transport | File I/O | WebSocket | HTTP |
| L2 Topology | Star + peer | Hub | Peer |
| L3 Protocol | JSON-in-JSON (thin) | Tool params (medium) | JSON-RPC 2.0 (thick) |
| Discovery | config.json | Session key | Agent Card (well-known URL) |
| Task state | 3 states | async + callback | Full state machine |
| Streaming | None | reply-back loop | SSE + webhook |
| Access control | Inherited from lead | role + scope + cap | OAuth 2.0 + Bearer |
The three systems aren’t in a “which is better” relationship — they’re each making reasonable design choices for their L0 environment. Claude Teams’ file I/O is the best choice in a same-machine scenario (easy to debug, zero infrastructure, free persistence), but you wouldn’t use file I/O for cross-network Agent communication — that’s where A2A’s HTTP + JSON-RPC makes sense.
Looking Back with the Model
Once I had the five-layer model, the first thing I did was use it to examine my own Agent Team design.
For L0 through L2, I realized I had already made the right choices without consciously knowing it: same machine, file I/O, star + limited peer — all consistent with OS best practices. Revisiting these decisions as a post-hoc validation confirmed the direction was right.
But at L3 and L4, things got interesting.
My Level 1 Completion Report was well-designed — structured, with defined fields and a Flag mechanism — but it had never actually been used. My dispatch mechanism had a send path and a return path, but the return path only made it halfway. The five-layer model helped me see these gaps. It also made me realize that the OS framework maps cleanly at L0-L2, but at L3-L4, my Agents have a fundamental difference from OS processes — a difference that makes the OS solution impossible to apply directly.
What is that difference? What did I do about it? And how should you think through it in an Agent Team context? That’s where this side story ends. The next article brings us back to the Agent Team hands-on series.
References
[1] arXiv — Solving Context Window Overflow in AI Agents (on how context window size affects reasoning quality) https://arxiv.org/html/2511.22729v1
[2] AI Pace — Context Engineering: Mitigating Context Rot in AI Systems (“the larger the context, the lower the model’s reliability”) https://medium.com/ai-pace/context-engineering-mitigating-context-rot-in-ai-systems-21eb2c43dd18
[3] Anthropic — Effective Context Engineering for AI Agents https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
[4] FINOS — Multi-Agent Trust Boundary Violations (cascading effects of scope violations) https://air-governance-framework.finos.org/risks/ri-28_multi-agent-trust-boundary-violations.html
[5] Google — A2A Protocol Specification (complementary to MCP) https://google.github.io/A2A/
Support This Series
If these articles have been helpful, consider buying me a coffee