Robotics and Embodied AI

In Short

Embodied AI is the study of intelligence that arises from a body interacting with a physical environment, in contrast to purely digital AI. Physical robots present unique challenges, most importantly the sim-to-real gap and Moravec's paradox, that have slowed progress relative to digital AI, but Vision-Language-Action models and advances in humanoid hardware have made 2024 to 2026 a period of rapid change.

01. What It Is

Embodied AI is AI research concerned with agents that perceive and act within a physical or richly simulated environment using a body. The core claim is that intelligence cannot be fully separated from physical interaction with the world: perception, action, and cognition are deeply linked.

This contrasts with purely linguistic AI, where the system processes text or other digital tokens. An embodied agent must interpret sensory streams (cameras, lidar, force sensors, microphones), plan in physical space, and execute precise motor commands in real time, often on hardware with strict latency requirements.

Classical robotics (pre-deep-learning) used explicit models: hand-engineered kinematics, pre-defined object models, and rule-based planning. This produced reliable, repeatable behavior in structured environments such as factory assembly lines, but broke down in unstructured environments where every object and situation must be anticipated in advance.

Learned robotics uses neural networks trained on data to map observations to actions. The shift began in earnest around 2016 to 2018 and has accelerated significantly since 2022 with large-scale transformer architectures.

02. Why It Matters

Physical AI is the missing piece:
Digital AI (language, image generation, reasoning) has improved dramatically. Physical AI has lagged. An AI that can perform reliable dexterous manipulation in a home or hospital would be transformative in ways that a chatbot cannot replicate. Tasks like loading a dishwasher, folding laundry, or assisting an elderly person require physical intelligence at a level current robots do not achieve reliably.

Moravec's paradox:
Hans Moravec observed in the 1980s that the things humans find hard (chess, calculus, language translation) are relatively easy for computers, while the things infants can do (grasping a novel object, walking on uneven terrain, recognising a face under varied lighting) are extremely hard for machines. This asymmetry persists. The sensors, actuators, and real-time coordination required for physical intelligence are a harder engineering challenge than the abstract reasoning tasks where AI has excelled.

Humanoid robots and the labor market:
As of 2026, multiple companies including Boston Dynamics, Figure AI, 1X Technologies, Agility Robotics, and Tesla (Optimus) are producing or testing humanoid robots intended for real-world deployment in warehouses and manufacturing. The economic implications are potentially large, and progress is faster than most observers expected in 2022.

03. How It Works

A typical modern robotic learning system has three components.

Perception:
Cameras, depth sensors, lidar, and proprioceptive sensors (joint angles, torques) provide the robot's observations. These are typically encoded by a vision encoder, often pre-trained at large scale on internet imagery.

Policy:
A neural network maps observations to actions. In the simplest case, actions are joint velocity or position commands. In more complex cases, the policy produces language-conditioned or semantically grounded actions.

Training:
Policies can be trained via imitation learning (learning from human demonstrations), reinforcement learning (optimising for a reward signal), or a combination. Large-scale data collection remains one of the primary bottlenecks.

04. The Sim-to-Real Gap

Training robots in physical reality is slow, expensive, and risky. Training in simulation is fast and cheap. But policies trained in simulation often fail when transferred to the real world, because the simulation is not a perfect model of reality.

Sources of the gap include: imperfect contact and friction models, visual appearance differences (textures, lighting), sensor noise, actuator dynamics, and unmodeled mechanical compliance. Researchers address this through domain randomisation (intentionally varying simulation parameters during training so the policy learns robustness), domain adaptation (aligning simulation and real visual distributions), and sim-to-real fine-tuning (training in simulation then briefly fine-tuning on real data).

World models (see World Models) partially address this: instead of a hand-engineered physics engine, a learned world model trained on real robot interaction data can generate more realistic synthetic experience. The Waymo World Model approach (February 2026) exemplifies this for autonomous driving.

05. Vision-Language-Action Models

The most significant recent development is the Vision-Language-Action (VLA) model, which applies the same paradigm that produced GPT-4 and similar large-scale models to robotic control.

RT-1 (Google, December 2022) demonstrated that a transformer trained on over 130,000 real-robot episodes in a kitchen environment could generalise to new tasks and objects, outperforming smaller models and conventional approaches. The key finding was that model and data scale both matter: larger models trained on more diverse data generalise better.

RT-2 (Google, July 2023) extended this by co-fine-tuning a pretrained Vision-Language Model (specifically PaLI-X and PaLM-E variants) on both robotic trajectory data and internet-scale vision-language tasks such as visual question answering. The actions were expressed as text tokens and trained in the same format as natural language responses. RT-2 demonstrated emergent capabilities: the ability to interpret commands not in the robot training data, perform rudimentary multi-step reasoning ("pick up the object that could be used as a hammer"), and generalise to novel objects. This is meaningful because it shows that internet-scale pretraining transfers to physical manipulation, not just language tasks.

Robotics foundation models:
The general direction since RT-2 has been toward models with broader generalisation: trained on diverse robot morphologies, diverse tasks, and diverse environments. Google DeepMind's follow-on work (RT-X and related efforts) involves pooling demonstration data across many robot types. Open-source initiatives like Open X-Embodiment aggregate data across 22 robot types and 527 skills.

06. Manipulation and Locomotion

These are the two primary physical challenges in robotics.

Manipulation (grasping, pushing, inserting, pouring) is hard because contact physics is complex, objects vary enormously, and even small positioning errors can cause failure. Deep learning has improved grasp prediction substantially, but robust dexterous manipulation in unstructured environments remains difficult.

Locomotion (walking, running, climbing stairs) has made striking progress. Boston Dynamics' Atlas demonstrated dynamic bipedal locomotion through reinforcement learning in simulation. Spot (quadruped) is deployed commercially. Reinforcement learning in simulation followed by real-world fine-tuning is the standard approach. The ETH Zurich group's ANYmal work showed quadruped locomotion over extreme terrain that classical controllers could not handle.

07. Humanoid Robots in 2026

The humanoid form factor has become commercially significant. Key companies and systems as of mid-2026:

Figure AI (Figure 01/02): Deployed in BMW manufacturing. In January 2024 demonstrated autonomous coffee machine operation after watching video demonstrations.

Boston Dynamics (Atlas): Transitioned from the hydraulic Atlas to an electric version in 2024 with faster, more fluid motion. Focused on warehouse and industrial use.

1X Technologies (NEO): Norwegian startup focused on home environments.

Tesla Optimus:
Deployed in limited numbers in Tesla factories.

Agility Robotics (Digit): Deployed in GXO Logistics factories.

Progress is faster than most researchers predicted in 2022. However, real-world deployment remains narrow in task scope: these robots do a small number of well-defined tasks reliably, not general household manipulation. The gap between "impressive demo" and "general home robot" remains wide.

08. Why Physical AI Lags Digital AI

Beyond Moravec's paradox, several structural factors explain the gap:

Data scarcity:
Internet text is available at essentially unlimited scale. Robot interaction data is expensive and slow to collect. A single teleoperation demonstration takes minutes of human time. Language models have trained on trillions of tokens. The largest robot datasets are in the tens of thousands of episodes.

Hardware constraints:
Actuators have bandwidth limits, latency requirements, and failure modes that software does not. Inference must happen in real time on embedded hardware.

Safety requirements:
A robot that fails can damage itself, its environment, or injure people. This slows iteration.

Physical diversity:
Every physical environment is different. Digital AI operates in a constrained domain (text, pixels). A physical robot must cope with an essentially infinite variety of object shapes, surfaces, and lighting conditions.

Feedback latency:
A language model generates text and the only feedback is a human reading it. A robot acting in the world receives feedback from physical outcomes that may take seconds or minutes to observe fully.

09. Common Pitfalls and Misconceptions

"Robotics is solved."
Demos are impressive. Deployment is narrow. Current robots handle specific tasks in controlled conditions. General-purpose physical dexterity in an unstructured home environment is not solved.

Humanoid form factor is obviously correct:
There are good reasons to use humanoid robots in human-designed environments. There are also good reasons to use specialised morphologies for specific tasks. The humanoid bet is plausible but not certain.

Simulation is sufficient for training:
The sim-to-real gap is real. Simulation-only training reliably underperforms training that includes real-world experience, especially for contact-rich manipulation.

VLA models generalise arbitrarily:
RT-2 is a vision-language-action model co-fine-tuned on robotic trajectory data and internet-scale vision-language data. It showed impressive emergent generalization, but generalisation across different robot morphologies, environments, and task types remains a research challenge.

Robotics and Embodied AI

In Short

01. What It Is

02. Why It Matters

03. How It Works

04. The Sim-to-Real Gap

05. Vision-Language-Action Models

06. Manipulation and Locomotion

07. Humanoid Robots in 2026

08. Why Physical AI Lags Digital AI

09. Common Pitfalls and Misconceptions

Verified against primary sources

Key terms

Tags

Sources

More in Other Applications