February 23, 2026

Every Robot Is an Agent Waiting to Happen — And We Have No Framework for It

10 min read/robotics

Here's something I can't stop thinking about. We've spent the last two years building increasingly powerful AI agents that can browse the web, write code, manage calendars, and coordinate complex workflows across cloud services. But the moment you ask one of these agents to move a wheel, read a camera feed, or pick up a cup — it falls apart. The gap between what AI agents can do in the cloud and what they can do in the physical world is enormous. And almost nobody is building the bridge.

I ran into this head-on recently. My team and I built a small swarm of robots running OpenClaw agents on Raspberry Pi 5s and a Pi 4. Each robot had wheels, cameras, microphones, and speakers. We exposed every physical capability — movement, vision, speech, interaction — as tools the agent could invoke. The robots could see their environment, follow people, reason about what was happening, and decide when to move or talk. With ElevenLabs integration, they could actually speak.

One robot acting autonomously was cool. But the interesting part was multiple robots communicating with each other in natural language, sharing observations, planning together, and coordinating actions in real time. That's when things started getting genuinely weird — and genuinely promising.

It also revealed a massive hole in how we think about robotics.

The Abstraction That's Missing

In cloud AI, we've converged on a clean abstraction: an agent is an LLM with access to tools. Tools are just functions — search the web, query a database, send an email. The agent reasons about which tools to call, in what order, with what parameters. This is the pattern behind every serious agent framework today, from LangChain to CrewAI to Anthropic's own agent architecture.

Now apply that same logic to a robot. A robot has sensors (cameras, microphones, LIDAR, IMUs) and actuators (wheels, arms, grippers, speakers). These are just tools. A camera is a tool that returns visual data. A wheel controller is a tool that accepts direction and speed. A microphone is a tool that returns audio. If you describe a robot's physical capabilities as a set of callable tools, then any LLM that understands function calling can, in principle, control that robot.

In principle, any robot can become an autonomous agent if you describe its sensors and actuators as tools. In practice, almost nobody does this — and there's no standard way to do it.

This is the abstraction that's missing. Not a better foundation model for robotics. Not a better sim-to-real pipeline. Just a clean, universal way to say: here's a robot, here are its capabilities described as tools, now let an LLM agent drive it.

Why Current Approaches Fall Short

There's been incredible work in LLM-powered robotics. Google's SayCan showed that you can ground language models in robotic affordances — the LLM proposes what to do, and an affordance function determines what's physically possible. RT-2 demonstrated vision-language-action models that transfer web knowledge directly to robotic control. Code as Policies showed LLMs generating executable robot code from natural language. ELLMER, published in Nature Machine Intelligence, proved that embodied LLMs can complete long-horizon tasks in unpredictable environments.

But here's the problem with all of these: they're tightly coupled to specific robots, specific environments, and specific research setups. SayCan only works with a predefined set of skills on a specific robot. RT-2 requires massive training pipelines. Code as Policies generates code for a particular robot API.

None of them answer the question: how do you take an arbitrary robot — any robot, with any combination of sensors and actuators — and make it an autonomous agent?

- not a model trained end-to-end for a specific robot

+ a framework where any robot describes its capabilities as tools, and any LLM agent can control it

ROS — the Robot Operating System — gets us partway there. It's the de facto middleware for robotics, handling communication between components, hardware abstraction, and sensor integration. Frameworks like ROS-LLM are starting to bridge the gap, enabling natural language interactions and LLM-based decision-making within ROS. But ROS was designed for a world of pre-programmed behaviors, not for a world where an LLM agent dynamically reasons about what to do next. Integrating LLMs with ROS still requires deep expertise in both domains, and the action space problem — the sheer number of possible things a robot could do — remains largely unsolved.

The Multi-Robot Problem Is Even Harder

One robot as an agent is a solvable engineering problem. A swarm of robots as a coordinated multi-agent system is a fundamentally different challenge.

When we built our robot swarm, the robots communicated with each other in natural language. It worked — surprisingly well, actually. One robot would say "I see a person moving left," and another would respond "I'll flank right." But natural language, while intuitive, is not a protocol. It's ambiguous, slow, and doesn't scale.

Recent research confirms this tension. LLM2Swarm has explored using LLMs to suggest global swarm strategies, with each agent consulting a shared blueprint for local decisions. Language-Guided Pattern Formation uses LLMs to translate high-level descriptions into swarm actions. SwarmChat, presented at ICSI 2025, enables natural language commands to robotic swarms through multiple LLM modules for intent recognition, task planning, and context generation.

But all of these still treat natural language as the primary communication layer. When robots need to coordinate in real time — sharing spatial data, synchronizing movements, negotiating task allocation — you need something structured. You need a protocol.

Where A2A and MCP Come In — And Where They Don't

This is where things get interesting. We already have the beginnings of agent communication protocols in the cloud world. Google's A2A (Agent-to-Agent) protocol, launched with over 50 tech partners, defines how autonomous agents discover and communicate with each other. Anthropic's MCP (Model Context Protocol) standardizes how agents interact with tools and data sources. Together, they form a two-layer architecture: MCP for agent-to-tool communication, A2A for agent-to-agent communication.

Now imagine extending this to physical robots.

— each robot runs an agent. the agent's tools are the robot's physical capabilities, described via something like MCP.

— robots communicate with each other using something like A2A — structured, typed messages for coordination, not free-form natural language.

— physical robot agents and cloud agents share the same orchestration layer, enabling a warehouse robot to coordinate with an inventory management agent in the cloud.

But neither A2A nor MCP was designed for the physical world. They don't account for real-time constraints, spatial coordination, sensor fusion across agents, or the fact that a physical action can't be rolled back the way a database transaction can. The latency requirements alone are fundamentally different — the moment between seeing an object and adjusting a gripper must happen in milliseconds, and a round trip to the cloud makes that impossible.

What's needed is a protocol layer that sits between the edge and the cloud. Inference and real-time control happen locally. Planning, learning, and cross-fleet coordination happen in the cloud. And a shared protocol connects them.

What a Real Framework Would Look Like

If I were designing this from scratch, here's what I'd want:

01 — a universal capability description format. a standard way for any robot to declare "I have wheels that move at X speed, a camera with Y resolution, a gripper with Z force." tools, essentially, with physical parameters.

02 — an agent runtime that works on edge devices. something that runs on a Raspberry Pi, takes those capability descriptions, and lets an LLM reason about and invoke them. local inference for real-time control, cloud fallback for complex planning.

03 — a coordination protocol for physical agents. structured messages for spatial awareness, task negotiation, and state synchronization. not natural language. something typed, versioned, and fast.

04 — a bridge to cloud agent ecosystems. so a physical robot agent can participate in the same workflows as a cloud agent. same protocol family, different execution constraints.

05 — a safety layer. because LLM output should never be executed directly on hardware without verification. generated plans need to pass through kinematic verifiers or simulators before reaching actuators.

The Pieces Are Starting to Exist

The encouraging thing is that the building blocks are emerging, even if no one has assembled them yet. OpenClaw is already running on Raspberry Pis as a persistent AI agent, and the peaq Robotics SDK is integrating robot capabilities as reusable skills within the agent ecosystem — skills that can be shared across robots, fleets, and deployments. ALRM (Agentic LLM for Robotic Manipulation) introduced a framework that integrates both code generation and tool-based execution for robotic control, supporting a mode that directly leverages LLM function calling for robot control. ROS-LLM demonstrated that you can get a robot from zero to LLM-controlled in ten minutes with the right integration layer.

But these are all separate efforts solving separate pieces of the puzzle. Nobody has built the unified stack that treats every robot as an agent, every capability as a tool, every robot-to-robot interaction as a protocol message, and every physical-digital handoff as a first-class concern.

The Bigger Picture

I wrote previously about why the agentic internet needs more than NLWeb — it needs agents talking to agents, with structured protocols, permission systems, and economic flows. The physical world is the same problem, amplified. If cloud agents need protocols to coordinate, physical agents need them even more. The stakes are higher, the latency constraints are tighter, and the consequences of miscommunication are measured in collisions, not 404 errors.

What I'm describing isn't science fiction. We ran a swarm of robots coordinating through OpenClaw agents at a hackathon. It worked. The robots saw, spoke, reasoned, and coordinated. But it also showed me exactly how far we are from having the infrastructure this needs. Natural language coordination hit its limits fast. The lack of a shared capability description format meant every robot needed custom integration. And there was no clean way to bring a cloud agent into the loop alongside the physical ones.

One robot is useful. Many robots working together as physical AI agents is a different category entirely. And it requires infrastructure that doesn't exist yet.

The AI world is converging on a clear pattern: agents with tools, connected by protocols. The robotics world hasn't caught up. But when it does — when we have a universal way to describe a robot as an agent, its capabilities as tools, and its interactions as protocol messages — that's when things get genuinely transformative. Not one robot doing one task. Fleets of robots, coordinating with each other and with cloud agents, dynamically reasoning about the physical world.

This is a gap I want to keep working on. If you're thinking about embodied agents, multi-agent systems, or agent protocols for the physical world — I'd love to talk.