Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained VLMs to enable instruction-following robotic manipulation. However, the core of VLM pre-training is aligning vision and language in a semantic space, whereas robotic actions live in the 3D physical space — a structural mismatch that makes a direct mapping hard to learn and prone to representation collapse. We propose AffordanceVLA, which introduces structured affordance forecasting as a task-oriented intermediate representation. We progressively model manipulation priors through three complementary components: Which2Act (object-centric grounding via visual-latent prediction), Where2Act (2D interaction localization via affordance maps), and How2Act (3D geometric reasoning). Integrated into a Mixture-of-Transformer (MoT) architecture and trained with a three-stage progressive curriculum — backed by an automated pipeline that synthesizes 100K+ affordance labels — AffordanceVLA delivers strong, sample-efficient performance across simulation and real-world manipulation.
Click any icon to jump to the corresponding section.
Previous works bridge perception and action via dense video prediction or visual foresight, but such signals are redundant and slow, while purely perceptive representations remain action-agnostic. Much like how humans naturally perceive a mug's handle as an invitation to grasp, affordances — manipulation priors that explicitly indicate which object to manipulate, as well as where and how to interact — serve as a perfect bridge by seamlessly coupling spatial grounding in vision, semantic conditioning in language, and execution guidance in action. AffordanceVLA unifies perception, prediction, and action by leveraging structured affordance forecasting as intermediate supervision within a Mixture-of-Transformer (MoT) architecture.
Three specialized experts. The Understanding Expert (\(\mathcal{M}_{und}\)) establishes a fine-grained alignment between visual perception and linguistic intent by leveraging pre-trained VLM priors, fusing the observation \(O_t\), instruction \(l\), and proprioceptive state \(s_t\) into an instruction-aware multimodal representation \(h_t^{und}\). The Affordance Generation Expert (\(\mathcal{M}_{gen}\)) acts as a visual planner, predicting a structured representation \(\hat{A}_{t}\) that anchors high-level semantics into actionable geometric cues. The Action Expert (\(\mathcal{M}_{act}\)) decodes these unified representations into smooth, temporally coherent action chunks — relieved from heavy visual reasoning, it focuses entirely on precise physical execution.
UAA progressive attention. To coordinate the experts, AffordanceVLA applies bidirectional intra-expert attention for thorough contextual fusion, while enforcing strict causal inter-expert attention across modules. The Affordance Generation Expert queries features exclusively from the Understanding Expert, while the Action Expert attends to the outputs of both preceding experts. This unidirectional flow prevents action information from leaking into the prediction stage, preserving the purity of affordance features and enhancing generalization.
Rather than predicting monolithic global features, the Affordance Generation Expert disentangles learnable affordance queries into three parallel sub-modules that concurrently decode manipulation priors from coarse to fine and from 2D to 3D. Bidirectional attention jointly refines their representations, yielding task-relevant priors that unify vision, language, and action.
Object-centric grounding. Crops the observation by the target bounding box and reconstructs a continuous visual latent \(z_q\) from a frozen encoder (e.g., Flux VAE), isolating the interacting entity while filtering background distractions.
2D interaction localization. Unfolds 1D query tokens into a 2D affordance map via a lightweight Transformer decoder, pinpointing interactive regions and providing explicit contact-point guidance.
3D geometric reasoning. Bifurcates into a diffusion-based 3D shape generation branch and a 10-DoF spatial layout regression branch (rotation, scale, translation), equipping the Action Expert with spatial priors and kinematic constraints.
Which2Act aligns intents with visual entities by reconstructing the target latent \(\hat{z}\) via a Mean-Squared-Error objective:
Where2Act aligns the predicted spatial logits \(\hat{y}\) with the ground-truth mask \(M\) using a pixel-wise Binary Cross-Entropy loss, where \(\sigma(\cdot)\) is the sigmoid function:
How2Act formulates 3D shape prediction as a conditional diffusion process and regresses the 10-DoF spatial layout with a component-wise Smooth-L1 loss:
A three-stage curriculum transitions the model from broad visual–linguistic grounding to affordance-centric reasoning and finally to domain-specific embodied control:
Automated affordance-annotation pipeline. To overcome the scarcity of dense affordance labels, we extract keyframes from action sequences; a text LLM (Claude Opus 4.5) decomposes the global instruction into per-keyframe sub-instructions, and a VLM (Qwen3-VL) converts each keyframe into a detection category and a spatial affordance query. These guide a fine-tuned RexOmni (via PRISM), integrated with SAM and SAM-3D, to yield over 100,000 dense affordance annotations.
AffordanceVLA is deployed on a real robot across three task families. The overlaid Where2Act affordance heatmaps reveal how the model grounds concise language instructions into precise interaction regions before acting.
Picking diverse objects across colors, shapes, and semantic categories — average success rate 88.3%.
Identical scene, different instruction. The Where2Act heatmap (right) shows where the model decides to act — disambiguating the command before execution (left).
Continuously clearing all rubbish from the table by re-evaluating the scene each step — and staying robust under human interference.
We evaluate AffordanceVLA on the LIBERO and CALVIN ABC→D simulation benchmarks as well as real-world tasks. We report two variants: AffordanceVLA (w/o stage II), which skips the affordance-augmented robotic co-training, and AffordanceVLA (full), trained with the complete three-stage strategy.
Our full model attains a strong average of 95.8% — the highest among the methods compared here and competitive with the best recent VLAs. Strikingly, even without Stage II, AffordanceVLA (w/o stage II) already reaches 86.2%, showing that the decoupled MoT design isolates task-relevant semantics from raw control signals and curbs representation collapse (Q2). The margin narrows only on LIBERO-Long, where extremely long-horizon tasks would further benefit from explicit memory.
| Method | Spatial | Object | Goal | Long | Average |
|---|---|---|---|---|---|
| OpenVLA | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| SpatialVLA | 88.2 | 89.9 | 78.6 | 55.5 | 78.1 |
| CoT-VLA | 87.5 | 91.6 | 87.6 | 69.0 | 83.9 |
| ThinkAct | 88.3 | 91.4 | 87.1 | 70.9 | 84.4 |
| π0 | 98.0 | 96.8 | 94.4 | 88.4 | 94.4 |
| GR00T-N1 | 94.4 | 97.6 | 93.0 | 90.6 | 93.9 |
| F1-VLA | 98.2 | 97.8 | 95.4 | 91.3 | 95.7 |
| AffordanceVLA (w/o stage II) | 88.5 | 91.7 | 91.3 | 73.3 | 86.2 |
| AffordanceVLA (full) | 98.6 | 98.4 | 96.2 | 89.8 | 95.8 |
On this zero-shot OOD protocol (train on A/B/C, test on the visually novel Env D), AffordanceVLA (full) reaches a strong average length of 4.33, completing all 5 consecutive tasks in 75.9% of rollouts — competitive among recent VLAs (Q1). By forcing the model to focus on task-critical entities, interaction regions, and spatial layouts, structured affordance prediction makes the perception–action mapping resilient to novel visual disturbances. The substantial jump from the w/o-stage-II variant (3.81) underscores the necessity of Stage II co-training for OOD generalization (Q3).
| Method | 1/5 | 2/5 | 3/5 | 4/5 | 5/5 | Avg. Len |
|---|---|---|---|---|---|---|
| RoboFlamingo | 82.4 | 61.9 | 46.6 | 33.1 | 23.5 | 2.48 |
| SuSIE | 87.0 | 69.0 | 49.0 | 38.0 | 26.0 | 2.69 |
| GR-1 | 85.4 | 71.2 | 59.6 | 49.7 | 40.1 | 3.06 |
| OpenVLA | 91.3 | 77.8 | 62.0 | 52.1 | 43.5 | 3.27 |
| CLOVER | 96.0 | 83.5 | 70.8 | 57.5 | 45.4 | 3.53 |
| UniVLA | 95.5 | 85.8 | 75.4 | 66.9 | 56.5 | 3.80 |
| π0 | 93.8 | 85.0 | 76.7 | 68.6 | 60.1 | 3.84 |
| Seer | 94.4 | 87.2 | 79.9 | 72.2 | 64.3 | 3.98 |
| VPP | 95.3 | 88.2 | 80.3 | 72.9 | 64.5 | 4.01 |
| Seer-Large | 96.3 | 91.6 | 86.1 | 80.3 | 74.0 | 4.28 |
| AffordanceVLA (w/o stage II) | 93.4 | 84.7 | 75.4 | 68.1 | 58.9 | 3.81 |
| AffordanceVLA (full) | 96.8 | 92.0 | 87.5 | 80.8 | 75.9 | 4.33 |
Across Basic Tasks, AffordanceVLA reaches an average success rate of 88.3%, consistently outperforming the π0 baseline over diverse objects, colors, and shapes. On Complex Tasks with severe visual aliasing (identical observations, distinct instructions), it unambiguously grounds concise intents into localized affordance heatmaps — e.g., 86.7% / 100.0% on Drawer (pick) / (close) versus π0's 46.7% / 40.0% — and sustains long-horizon execution on Pick all the rubbish.
| Method | Close | Pick up (Color) | Pick up (Shape) | Pick up | Average | ||||
|---|---|---|---|---|---|---|---|---|---|
| microwave | safe | red | green | duck | banana | flower | bear | ||
| π0 | 86.7 | 86.7 | 80.0 | 80.0 | 26.7 | 73.3 | 53.3 | 80.0 | 70.8 |
| AffordanceVLA | 93.3 | 100.0 | 86.7 | 80.0 | 86.7 | 86.7 | 80.0 | 93.3 | 88.3 |
| Method | Drawer | Toaster | Pick all the rubbish | Average | |||||
|---|---|---|---|---|---|---|---|---|---|
| pick | close | pick | toast | 1st ↑ | 2nd ↑ | 3rd ↑ | Empty ↓ | ||
| π0 | 46.7 | 40.0 | 46.7 | 26.7 | 93.3 | 53.3 | 6.7 | 33 | 44.8 |
| AffordanceVLA | 86.7 | 100.0 | 80.0 | 86.7 | 100.0 | 80.0 | 46.7 | 11 | 82.9 |
Where does the gain come from? A natural concern is whether the improvement stems from high-quality data, added supervision density, or the structured representation itself. We design three controls that each hold one factor fixed:
Together these directly answer Q2: it is the decoupled, jointly-optimized MoT design — not merely more data nor an off-the-shelf affordance module — that prevents collapse and unlocks the gains.
| Method | LIBERO (Success Rate %) | CALVIN ABC→D | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Spatial | Object | Goal | Long | Avg. | 1/5 | 3/5 | 5/5 | Avg. Len | |
| Architecture Design & Training Strategy | |||||||||
| No-Afd (Pi0 Arch) | 96.0 | 95.4 | 92.4 | 85.8 | 92.4 | 94.5 | 78.0 | 62.8 | 3.93 |
| Frozen-Afd | 68.0 | 71.1 | 66.4 | 62.9 | 67.1 | 85.3 | 55.9 | 26.3 | 2.83 |
| AffordanceVLA w/o stage II | 88.5 | 91.7 | 91.3 | 73.3 | 86.2 | 93.4 | 75.4 | 58.9 | 3.81 |
| Affordance Representation | |||||||||
| w/o Which2Act | 97.5 | 97.6 | 95.0 | 88.1 | 94.6 | 96.7 | 83.3 | 72.1 | 4.20 |
| w/o Where2Act | 95.5 | 96.0 | 93.4 | 88.0 | 93.2 | 96.2 | 81.9 | 69.8 | 4.13 |
| w/o How2Act | 96.1 | 96.5 | 93.9 | 88.2 | 93.7 | 95.0 | 79.4 | 65.9 | 4.01 |
| Attention Mechanism for Affordance | |||||||||
| Block-wise Tokens | 92.4 | 92.9 | 89.8 | 86.0 | 90.3 | 94.1 | 77.1 | 61.7 | 3.89 |
| AffordanceVLA (Full) | 98.6 | 98.4 | 96.2 | 89.8 | 95.8 | 96.8 | 87.5 | 75.9 | 4.33 |
Structured representation (Q1). Removing any single head (Which/Where/How2Act) yields only a graceful degradation rather than a catastrophic collapse — evidence that the three sub-modules are not a brittle Which→Where→How pipeline, but are jointly refined under a shared instruction-aware representation and consumed together by the Action expert. Notably, How2Act's benefit is modest on simple tabletop two-finger settings yet becomes pronounced on complex real-world 6-DoF interactions, exactly where 3D shape and layout priors matter most.
Scaling downstream fine-tuning data from 10% to 100%, the vanilla π0 starts strong (pre-trained weights) but quickly hits a rigid ceiling. AffordanceVLA instead surges: with only 40% of the data it already attains ~92% on LIBERO and an average length above 4.0 on CALVIN, shattering the ceiling of the fully fine-tuned π0. Because the affordance representation decomposes the perception–action mapping into interpretable sub-problems, each sample supervises not only the action but also object grounding, spatial localization, and 3D reasoning — effectively multiplying the learning signal per sample.
A closer look at failure modes is revealing. In the Toaster task, π0 performs poorly (toast 26.7% vs. our 86.7%) and its bad cases concentrate at the button-pressing step: instead of extending to press the button, it often closes the gripper as if still doing a pick-and-place, largely disregarding the “press the button” instruction. Even after real-trajectory fine-tuning, π0 still suffers from weak instruction following — its behavior is driven by the dominant action prior rather than the language command. We offer an intuition for why affordance grounding alleviates this: in an action-only VLA, the low-level action loss is back-propagated directly into the VLM backbone and may gradually erode the instruction-following ability that pre-training endowed. Recent strong policies (π0.5, π0.7) implicitly mitigate this with train-only structured supervision; affordance plays a similar — arguably more natural — role, keeping its training signal close to the VLM's semantic space.
We present AffordanceVLA to bridge the structural gap between the semantic space of Vision-Language Models and the 3D physical requirements of embodied control. Instead of relying on direct end-to-end mappings or redundant visual foresight, our framework adopts affordances as a task-oriented intermediate representation and decomposes affordance forecasting into Which2Act, Where2Act, and How2Act. With a Mixture-of-Transformer architecture and a progressive data curriculum, AffordanceVLA achieves strong, competitive performance on LIBERO, CALVIN, and real-world experiments, demonstrating strong generalization and robust reasoning. Future work will explore explicit temporal modeling as well as extensions to bimanual and deformable object manipulation.
Coming soon.