Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Chang Nie1, Tianchen Deng1, Guangming Wang2, Zhe Liu1, Hesheng Wang1
1Shanghai Jiao Tong University, China 2University of Cambridge, UK

From VLA to VSLA

Modern Vision-Language-Action policies have shown that large multimodal models can map images, language, and robot state directly to action. This works well when task-relevant evidence is visually persistent. A cup stays on the table, a drawer remains open, and the camera can observe those facts over many control cycles. Recent systems have started to add audio, but in most cases sound is still treated as an episodic input before action or as a channel mainly for speech understanding.

Sound-centric manipulation is different. Here the critical evidence may be a brief beep, a collision click, a subtle change in boiling sound, or the prosody of a spoken confirmation. These cues are often short, non-repeatable, and only meaningful if the robot listens at the right moment during execution. The problem is not just adding one more modality, but dealing with the fact that audio is distributed over time very differently from vision.

Existing ways of attaching sound to VLA are often too static for this setting. Automatic speech recognition throws away non-speech cues and prosody, while waveform-as-image adapters flatten a temporal signal into a single visual snapshot. Both approaches become brittle when the cue is brief, when the cue happens between two policy queries, or when the task depends on how sound evolves rather than on one isolated event.

This is exactly where modern action chunking becomes a problem. Large policies often predict a chunk of actions and execute it open-loop to keep motion smooth. During that interval, new observations cannot change the current command sequence. For sound-centric tasks, this creates a structural blind spot: a short cue can happen and vanish before the next decision ever sees it. We refer to this as the Blind Execution Interval.

These limitations motivate the Vision-Sound-Language-Action paradigm. VSLA extends VLA from “see and act” to “see, hear, remember, and react” under delayed asynchronous control. In this paper we instantiate that paradigm with HEAR, and we accompany it with a pretraining dataset and a sound-causal benchmark so that the whole pipeline can be trained and evaluated consistently.

HEAR framework and sound-centric manipulation overview

HEAR reframes manipulation from a vision-dominant setting into a sound-causal one: the policy must preserve fleeting acoustic evidence, reason over multi-sensory context, and only act after the right sound condition has actually occurred.

What We Propose

Our response has three parts. First, we formalize VSLA as a control setting in which each decision receives delayed multi-view vision, a causal audio segment, language, and proprioception. Second, we design HEAR as a concrete architecture for that setting. Third, we build the surrounding resources needed to train and test such a system at scale.

HEAR is designed around two bottlenecks identified in the paper. The first is causal persistence: transient cues must survive across execution gaps. The second is temporal grounding: during long waiting periods with quasi-static vision, the policy still needs a sense of how the task is progressing over time.

Beyond the model itself, the paper also contributes OpenX-Sound for large-scale audio-augmented pretraining and HEAR-Bench for strict sound-causal evaluation. This matters because a system like HEAR is only meaningful if the benchmark explicitly rejects visually plausible but acoustically wrong behavior.

VSLA

A control paradigm for delayed robot decisions conditioned on vision, streaming sound, language, and proprioception.

HEAR

An architecture that couples causal audio memory with multimodal reasoning, temporal prediction, and smooth action generation.

OpenX-Sound

An audio-augmented pretraining resource that extends Open X-Embodiment with synchronized sound.

HEAR-Bench

A sound-causal benchmark that rejects premature actions and tests whether the policy truly listens before acting.

How HEAR Works

The overall logic of HEAR follows the same progression as the problem itself. If sound can be lost between two policy queries, the system first needs a memory mechanism. If long waiting periods make the task temporally ambiguous, the system then needs a representation that understands progression over time. If the final policy is executed in chunks, it also needs an action generator that stays smooth enough to avoid introducing unnecessary ego-noise.

HEAR full architecture overview

HEAR processes sound continuously, reasons over integrated multi-sensory context at decision time, learns temporal progression through future-audio prediction, and produces smooth action chunks for execution.

Historizer

Historizer module figure

The Historizer is a streaming stateful transformer that updates a compact causal memory from incoming audio packets. It directly targets the BEI issue by preserving short-lived cues that would otherwise disappear from a fixed per-query window.

Instead of forcing the backbone to process very long raw windows, it carries forward a compressed memory state aligned with real execution gaps. This is what lets the next decision know that a critical cue just happened even if the cue itself has already vanished.

In other words, the Historizer is the part that keeps the system causally aware of the recent acoustic past.

Envisioner

Envisioner module figure

The Envisioner is a hierarchical multimodal reasoner. A high-level omni-model fuses vision, language, proprioception, current audio window, and Historizer memory into semantic context; a low-level model reuses this context for control-oriented features.

This split is important because sound-centric tasks often contain visually similar phases with different intent, such as waiting, monitoring, reacting, or overriding an action already in progress. The Envisioner turns those subtle differences into a control-ready representation.

It is also the module that lets HEAR bind sound evidence to task context, instead of treating audio as an isolated side channel.

Advancer

Advancer module figure

The Advancer predicts near-future audio codes from latent representations. This auxiliary objective injects explicit temporal grounding, helping the policy disambiguate long quasi-static waiting phases.

This matters most in process-monitoring tasks, where the robot must detect progression over time rather than respond to a single spike. The Advancer encourages the latent space to encode how the scene is evolving acoustically.

This turns audio from a mere detection signal into a source of temporal structure for long-horizon control.

Realizer

The Realizer transforms the inferred control state into smooth joint-position action chunks using conditional flow matching. This design is not just about nicer motion. In sound-centric manipulation, jerky behavior injects ego-noise and can mask weak task-relevant acoustic cues.

By generating smoother chunks, the Realizer improves both control quality and acoustic observability, which is especially important in tasks that alternate between waiting, listening, and fast reaction.

The Realizer closes the loop by making the action side consistent with the auditory demands of the task.

Training Data and Evaluation Pipeline

The paper argues that a new architecture alone is not enough. Sound-centric manipulation also lacks the large-scale pretraining data and the timing-aware benchmarks that made VLA progress possible in the first place. For that reason, HEAR is paired with both a training resource and a dedicated evaluation suite.

OpenX-Sound augments Open X-Embodiment trajectories with synchronized audio so that the model can first learn broad multi-sensory representations. After that pretraining stage, HEAR is fine-tuned on task-specific demonstrations and evaluated in settings where sound is genuinely necessary for success.

HEAR-Bench is designed specifically for sound-causal manipulation. Its tasks span four categories of acoustic cues: event-triggered alarms, human speech, continuous process sounds, and physical interaction feedback. Across the suite, success is timing-sensitive: if the robot reaches the goal before the required sound cue occurs, the rollout is marked as failure.

Simulation benchmark

The seven simulation tasks are carefully chosen to cover complementary sound-centric challenges. Alarm Clock and Microwave test waiting and transient triggers; Check Yes and Interrupt test speech timing and prosody; Check Materials tests impact acoustics; Pour Water and Boil Water test long-horizon process monitoring.

Real-robot deployment

The physical tasks increase difficulty with room reverberation, background noise, mechanical ego-noise, and stronger visual aliasing. Moka Coffee requires process monitoring, Answer Phone is a multi-stage speech-and-event task, Shake Bottle uses active acoustic sensing, and Real Alarm Clock tests robust waiting under domain shift.

These experiments are intentionally constructed rather than merely collected. Each task isolates a specific failure mode of memoryless audio integration, such as transient cue dropout, prosody loss, visual aliasing, or long-horizon process monitoring. Under this evaluation protocol, HEAR reaches 81% average success in simulation and 54% on real-robot tasks, with especially strong gains on brief non-speech triggers and process-monitoring scenarios.

Experiment Videos

This page currently highlights only real-robot deployments. The selected tasks focus on long-horizon monitoring, active acoustic sensing, speech-conditioned progress tracking, and transient cue reaction under real-world noise and reverberation. Each example is designed to answer a concrete question about what kind of sound-causal competence the model has learned. Videos auto-play when they enter the viewport and pause once they leave it.

Real-Robot Deployments

Moka Coffee

Why this task: A traditional Moka pot offers almost no clean visual completion signal, so progress must be inferred from sound.

What it tests: Long-horizon process monitoring in a noisy real environment, where the robot must detect the transition to the sputtering phase before pouring.

Answer Phone

Why this task: The robot revisits nearly identical visual states at different semantic stages of the task.

What it tests: Multi-stage progress tracking with ring detection, speech understanding, and end-of-call recognition under severe visual aliasing.

Shake Bottle (Empty)

Why this task: The hidden state cannot be recovered passively; the robot must first create the evidence itself.

What it tests: Active acoustic sensing, where perception depends on generating a repeatable motion that elicits a discriminative sound.

Shake Bottle (Occupied)

Why this task: This companion case shows the same active sensing behavior under a different hidden physical state.

What it tests: Whether the policy can use self-generated rattling sound to distinguish occupied from empty bottles consistently across trials.

Real Alarm Clock

Why this task: The simulation alarm setting is transferred into the physical world, where reverberation and background chatter distort the cue.

What it tests: Whether the model remains sound-causal under real acoustic domain shift and still avoids the visually tempting early press.

Paper PDF

ArXiv PDF link placeholder. Replace ARXIV_ID once available.

Open arXiv PDF

BibTeX

@article{nie2026hear,
  title={Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation},
  author={Nie, Chang and Deng, Tianchen and Wang, Guangming and Liu, Zhe and Wang, Hesheng},
  journal={International Journal of Robotics Research (under review)},
  year={2026},
  url={https://biubiu3.github.io/HEAR-Web/}
}

Page Clicks

Cumulative opens of this project page after public tracking started.

Total clicks --
Loading cumulative clicks...

Daily points are pulled from GoatCounter and plotted as a cumulative click trajectory.