Vision-Language-Action Models

March 3, 2026

vlaembodied-airobotics

Vision-Language-Action (VLA) models represent a recent and rapidly evolving paradigm in robotics and embodied AI that unifies visual perception, natural language understanding, and motor control within a single end-to-end architecture. Rather than treating perception, planning, and action as separate modules in a traditional robotics pipeline, VLAs leverage large pre-trained vision-language models (VLMs) and fine-tune them to directly output low-level robot actions conditioned on visual observations and language instructions. This approach capitalizes on the rich semantic representations learned during internet-scale pre-training, enabling robots to generalize across tasks, environments, and even embodiments with remarkable flexibility.

The core formulation of a VLA treats robot control as a conditional generation problem. Given a language instruction $l$ , a history of image observations $o_{1:t} = {o_1, o_2, \dots, o_t}$ , and optionally the robot's proprioceptive state $s_t$ , the model learns a policy $\pi_\theta$ that produces an action $a_t$ :

a_t = \pi_\theta(l, , o_{1:t}, , s_t)

In practice, actions are often discretized into tokens and predicted autoregressively, mirroring the next-token prediction objective used in large language models. The training objective is typically to minimize the negative log-likelihood over a dataset $\mathcal{D}$ of expert demonstrations:

\mathcal{L}(\theta) = -\mathbb{E}{(l, o{1:T}, s_{1:T}, a_{1:T}) \sim \mathcal{D}} \left[ \sum_{t=1}^{T} \log \pi_\theta(a_t \mid l, , o_{1:t}, , s_t) \right]

Alternatively, some VLA architectures predict continuous actions via diffusion-based denoising or flow matching. In the diffusion formulation, a noisy action $a_t^{(k)}$ at diffusion step $k$ is iteratively refined by a learned denoising network $\epsilon_\theta$ :

a_t^{(k-1)} = \frac{1}{\sqrt{\alpha_k}} \left( a_t^{(k)} - \frac{1 - \alpha_k}{\sqrt{1 - \bar{\alpha}k}} , \epsilon\theta(a_t^{(k)}, k, z_t) \right) + \sigma_k , \eta

where $z_t$ is a conditioning embedding derived from the vision-language backbone, $\alpha_k$ and $\bar{\alpha}_k$ are noise schedule parameters, and $\eta \sim \mathcal{N}(0, I)$ . This formulation allows modeling of multimodal action distributions, which is critical for contact-rich manipulation tasks where multiple valid action trajectories may exist for a given observation.

The foundational insight behind VLAs emerged from the RT-2 model, which demonstrated that a pre-trained VLM could be directly fine-tuned on robotics data by representing actions as text tokens, yielding a model that could follow novel language instructions and exhibit chain-of-thought reasoning during manipulation. This was preceded by RT-1, which established the viability of transformer-based policies trained on large-scale robot demonstration datasets. OpenVLA subsequently showed that open-source VLMs could match or exceed the performance of proprietary VLA systems, democratizing access to this model class. More recent work such as $\pi_0$ introduced flow matching action heads on top of VLM backbones to handle dexterous manipulation, while OpenVLA-OFT proposed efficient fine-tuning strategies for adapting VLAs to new domains. Octo provided an open-source generalist policy framework emphasizing cross-embodiment transfer, and RoboVLMs offered a systematic evaluation of design choices across the VLA landscape.

Scaling laws familiar from language modeling appear to hold in the VLA setting: larger vision-language backbones, more diverse demonstration data, and broader task distributions consistently improve generalization. Cross-embodiment training, where a single model is trained on data from heterogeneous robot platforms, has emerged as a particularly promising direction for building generalist robot policies. The combination of internet-scale pre-training with robot-specific fine-tuning allows VLAs to leverage commonsense knowledge about objects, spatial relations, and task semantics that would be prohibitively expensive to learn from robot data alone.

Despite their promise, VLAs face several open challenges. Inference latency remains a concern for real-time control, as large transformer backbones can be slow relative to the control frequencies required for dynamic tasks. Data efficiency, sim-to-real transfer, safety guarantees, and the ability to handle long-horizon multi-step tasks are active areas of research. Nevertheless, VLAs represent one of the most compelling paths toward general-purpose robot intelligence, bridging the gap between the semantic richness of foundation models and the physical grounding required for real-world action.

References

Brohan, A., Brown, N., Carbajal, J., et al. RT-1: Robotics Transformer for Real-World Control at Scale. RSS, 2023.
Brohan, A., Brown, N., Carbajal, J., et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. CoRL, 2023.
Kim, M. J., Pertsch, K., Karamcheti, S., et al. OpenVLA: An Open-Source Vision-Language-Action Model. arXiv preprint arXiv:2406.09246, 2024.
Black, K., Brown, N., Driess, D., et al. $\pi_0$ : A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164, 2024.
Team, Octo Model, et al. Octo: An Open-Source Generalist Robot Policy. RSS, 2024.
Kim, M. J., Pertsch, K., Karamcheti, S., et al. Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (OpenVLA-OFT). arXiv preprint arXiv:2502.19645, 2025.
Li, G., Jin, H., Lei, Z., et al. RoboVLMs: Vision-Language-Action Models for Robotic Manipulation — What Matters and Why. arXiv preprint arXiv:2411.14238, 2024.
O'Neill, J., Rehman, A., Maddukuri, A., et al. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. ICRA, 2024.