Agent Team in Practice (Side Story): Agent Communication Through the Lens of OS IPC

Contents

Agent Communication Is OS IPC
The Five-Layer Decision Model
L1 — Transport: How to send?
L2 — Topology: Who talks to whom?
L3 — Protocol: What format?
L4 — Content Contract: What to send?
Clearing Up a Common Confusion: MCP vs A2A
Three-System Comparison: Claude Teams vs OpenClaw vs A2A
Looking Back with the Model

OS IPC concepts mapped to Agent communication architecture — Agent communication isn't a new problem — OS IPC solved it decades ago

In one line: Agent communication isn’t a new problem. OS IPC solved it decades ago, and the mapping is almost direct. Heads up: This article leans pretty heavily on OS design concepts. If you’d rather skip the technical depth, the next article’s hands-on walkthrough might be more intuitive to start with.

Last week, a former colleague posted an article in our company Slack analyzing Agent communication architectures. He broke down three systems — Claude Code Agent Teams, OpenClaw, and A2A Protocol — and then did something that immediately caught my eye: he built the entire analysis framework on top of OS IPC (Inter-Process Communication).

I stopped halfway through reading it, because I suddenly remembered that back in November I had looked into OS frameworks myself. At the time, while developing the Agent memory and communication framework, I had that feeling of “this seems like something I’ve read before, let me check” — but I looked it up, never dug in, and forgot about it. Seeing my former colleague’s article reconnected that thread. He has a CS degree, so his grasp of the concepts was a lot cleaner than mine. So I asked Em to ingest the article into the wiki and see if we could extract a usable analysis framework from it.

After Em ingested it, the wiki gained a new concept page: agent-communication-layers.md, which laid out a five-layer model. That model can be used to examine my own Agent Team’s communication design — which is what comes next (the following article). But before we get there, let me explain the model itself.

Agent Communication Is OS IPC

This isn’t a metaphor. It’s a direct mapping.

How do processes pass data to each other in an OS? Pipes, shared memory, message queues, file locks, signals — all things you touched when studying operating systems (even if you never imagined back then that you’d one day use them to design AI Agent communication architectures). In Agent systems, these concepts replay almost unchanged:

OS Concept	Agent Equivalent	Example
Process	Agent	Each Claude Code teammate
OS kernel	Harness / Runtime	Claude Code, OpenClaw
fork() + exec()	Spawn sub-agent	Agent tool, TeamCreate
pipe (stdin/stdout)	Parent-child delegation	Subagent pattern
file + flock()	Filesystem mailbox	Claude Code Agent Teams
TCP socket	WebSocket	OpenClaw Gateway
HTTP/RPC	JSON-RPC	A2A Protocol
shared memory	Shared message pool	MetaGPT semantic layer
signal	Hook events	TeammateIdle, Stop hook
process state	Task lifecycle	pending → working → completed
Permission inheritance	Permission mode inheritance	Teammate inherits lead’s permission mode
context switch	Compaction + session summary	Agent rebuilds context from mailbox on wake

Even the classic OS problems show up identically. Scheduling, memory isolation, permission inheritance — each has its counterpart challenge in Agent systems. But the one that hit me hardest was context switch: when an OS process wakes up, it has to rebuild state from registers; when an Agent wakes up, it has to read messages from the mailbox to rebuild its context. Those first few seconds at the start of a session, the Agent is essentially experiencing a TLB miss — it has to reload all the state that was scattered around before it can continue working. And the quality of that reconstruction directly affects every judgment it makes afterward.

So if you’re designing Agent-to-Agent communication, don’t start from scratch. OS IPC has been solving this for decades. You can stand directly on those shoulders.

One-to-one mapping between OS concepts and Agent concepts — Not a metaphor — a direct mapping. Decades of OS design experience, ready to stand on.

The Five-Layer Decision Model

Since Agent communication is IPC, how do we decompose IPC design? My colleague’s article distills a five-layer model — bottom-up, where each layer’s choices constrain the options available to the next:

L4  Content Contract    — what to send (memory selection + compression)
L3  Protocol            — what format (message envelope, interaction pattern, task lifecycle)
L2  Topology            — who talks to whom (hierarchy / star / peer / pub-sub)
L1  Transport           — how to send (pipe / file / WebSocket / HTTP)
L0  Environment         — where the Agents live (same process / machine / network / internet)

flowchart TB L4["L4 Content Contract What to send"] L3["L3 Protocol What format"] L2["L2 Topology Who talks to whom"] L1["L1 Transport How to send"] L0["L0 Environment Where Agents live"] L0 --> L1 L1 --> L2 L2 --> L3 L3 --> L4 style L0 fill:#6b8f71,color:#fff style L1 fill:#6b8f71,color:#fff style L2 fill:#6b8f71,color:#fff style L3 fill:#c67a50,color:#fff style L4 fill:#c67a50,color:#fff

The five-layer decision model: design bottom-up, each layer constrains the next — green (L0-L2) maps cleanly to OS, orange (L3-L4) is where things start to break down

The bottom layer, L0 (Environment), usually isn’t something you choose — it’s determined by the harness. Agents running in the same process use function calls (Smolagents, LangGraph); same machine means file or pipe (Claude Code); across the network means WebSocket or HTTP (OpenClaw, A2A). My Agent Team all runs on a single Mac, so L0 = same machine, which directly narrows the options at every layer above it.

L1 — Transport: How to send?

Transport	Latency	Coupling	Persistence	Debug friendliness
pipe (stdin/stdout)	Lowest	Highest (parent-child)	None	Low
file + flock()	Low	Low (any topology)	Yes	Highest (`cat inbox/`)
WebSocket	Low	Medium (needs server)	During connection	Medium
HTTP	Medium	Lowest (stateless)	None	Medium

Lower latency means higher coupling; more independence means more overhead. This is exactly the same trade-off curve you face when choosing IPC mechanisms in an OS.

Claude Code Agent Teams chose file because they accurately assessed their environment: all Agents are guaranteed to be on the same machine, file I/O is easier to debug than any message queue, and you can just cat the inbox JSON to see what’s happening. That debug advantage is worth a lot in practice (I experienced this firsthand in the fifth article — when a silent failure hit, being able to read the inbox directly saved a lot of guesswork).

L2 — Topology: Who talks to whom?

Topology	Control style	Bottleneck	Example
Hierarchy	Parent has full control	Parent	Subagent pattern
Star	Central coordinator	Coordinator	OpenClaw Gateway, my GM
Peer	No central control	None	A2A Protocol
Pub/Sub	Event-driven	None	MetaGPT message pool

One important point: transport and topology are decoupled. WebSocket doesn’t imply star topology; HTTP doesn’t imply P2P. Choosing star is often about operational simplicity (not requiring each Agent to open its own port), not a technical constraint.

My Agent Team uses star (GM as central coordinator) + limited peer (Agents can communicate directly through the filesystem for core work), a hybrid topology that grew out of actual requirements: GM handles coordination and judgment, but Em and C7 doing core work don’t need to route through GM just to read each other’s files.

L3 — Protocol: What format?

Protocol defines three things: message envelope (sender, recipient, timestamp, message ID), interaction pattern (fire-and-forget / request-response / streaming / multi-turn), and task lifecycle (state machine: submitted → working → input-required → completed / failed).

Protocol complexity scales with trust boundary distance:

System	Protocol thickness	Reason
Claude Teams	Thin (JSON-in-JSON)	Internal, same machine, trusted
OpenClaw	Medium (tool params + callback)	Same server, semi-trusted
A2A	Thick (JSON-RPC 2.0 + OAuth 2.0)	Open internet, untrusted

Thin protocol internally, thick protocol externally — same system, different thickness for different scenarios. Consistent with the OS approach.

L4 — Content Contract: What to send?

The first four layers just get bytes from A to B. L4 determines whether the other side can actually use them.

What Agents send each other isn’t arbitrary text — it’s structured content that’s been through memory selection and compression. You can have a perfect transport layer, but if Agent A sends a massive raw context dump and Agent B’s context window overflows, reasoning quality collapses and communication fails.

L4 is the real deciding factor in whether Agent communication succeeds. In my former colleague’s words, it’s the layer every system currently handles most crudely. The JIT loading and token budget principles I discussed in the Context Engineering article should actually extend to inter-Agent communication as well: you wouldn’t stuff everything into a single Agent’s Context, and you shouldn’t stuff everything into a single inter-Agent message.

Clearing Up a Common Confusion: MCP vs A2A

These two are often compared together, but they solve fundamentally different problems:

Dimension	MCP	A2A
Direction	Vertical (agent → tool / data)	Horizontal (agent → agent)
OS analogy	System call	Cross-process IPC
Purpose	Let Agents call external databases, APIs	Let Agents negotiate and delegate with other Agents

The two are complementary, not competing. The official A2A spec explicitly states these are complementary protocols. An Agent can use MCP to call tools while simultaneously using A2A to communicate with another Agent — within the same workflow.

Three-System Comparison: Claude Teams vs OpenClaw vs A2A

Laying the three systems out against the five-layer model, the design choices at each layer become clear:

Dimension	Claude Teams	OpenClaw	A2A
L0 Environment	Same machine	Same server	Open internet
L1 Transport	File I/O	WebSocket	HTTP
L2 Topology	Star + peer	Hub	Peer
L3 Protocol	JSON-in-JSON (thin)	Tool params (medium)	JSON-RPC 2.0 (thick)
Discovery	config.json	Session key	Agent Card (well-known URL)
Task state	3 states	async + callback	Full state machine
Streaming	None	reply-back loop	SSE + webhook
Access control	Inherited from lead	role + scope + cap	OAuth 2.0 + Bearer

The three systems aren’t in a “which is better” relationship — they’re each making reasonable design choices for their L0 environment. Claude Teams’ file I/O is the best choice in a same-machine scenario (easy to debug, zero infrastructure, free persistence), but you wouldn’t use file I/O for cross-network Agent communication — that’s where A2A’s HTTP + JSON-RPC makes sense.

Looking Back with the Model

Once I had the five-layer model, the first thing I did was use it to examine my own Agent Team design.

For L0 through L2, I realized I had already made the right choices without consciously knowing it: same machine, file I/O, star + limited peer — all consistent with OS best practices. Revisiting these decisions as a post-hoc validation confirmed the direction was right.

But at L3 and L4, things got interesting.

My Level 1 Completion Report was well-designed — structured, with defined fields and a Flag mechanism — but it had never actually been used. My dispatch mechanism had a send path and a return path, but the return path only made it halfway. The five-layer model helped me see these gaps. It also made me realize that the OS framework maps cleanly at L0-L2, but at L3-L4, my Agents have a fundamental difference from OS processes — a difference that makes the OS solution impossible to apply directly.

L0-L2 green light, L3-L4 red light — The lower layers are solid, but the upper layers hit a wall — OS processes all look the same; my Agents don't.

What is that difference? What did I do about it? And how should you think through it in an Agent Team context? That’s where this side story ends. The next article brings us back to the Agent Team hands-on series.

References

[1] arXiv — Solving Context Window Overflow in AI Agents (on how context window size affects reasoning quality) https://arxiv.org/html/2511.22729v1

[2] AI Pace — Context Engineering: Mitigating Context Rot in AI Systems (“the larger the context, the lower the model’s reliability”) https://medium.com/ai-pace/context-engineering-mitigating-context-rot-in-ai-systems-21eb2c43dd18

[3] Anthropic — Effective Context Engineering for AI Agents https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

[4] FINOS — Multi-Agent Trust Boundary Violations (cascading effects of scope violations) https://air-governance-framework.finos.org/risks/ri-28_multi-agent-trust-boundary-violations.html

[5] Google — A2A Protocol Specification (complementary to MCP) https://google.github.io/A2A/

Support This Series

If these articles have been helpful, consider buying me a coffee

☕ Buy me a coffee

Agent Team in Practice (Side Story): Agent Communication Through the Lens of OS IPC — A Five-Layer Decision Model