Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

Nie, Chang; Deng, Tianchen; Wang, Guangming; Liu, Zhe; Wang, Hesheng

Towards the Vision-Sound-Language-Action Paradigm:
The HEAR Framework for Sound-Centric Manipulation

Chang Nie¹, Tianchen Deng¹, Guangming Wang², Zhe Liu¹, Hesheng Wang¹

¹Shanghai Jiao Tong University, China ²University of Cambridge, UK

Paper Code arXiv Dataset

From VLA to VSLA

Modern Vision-Language-Action policies have shown that large multimodal models can map images, language, and robot state directly to action. This works well when task-relevant evidence is visually persistent. A cup stays on the table, a drawer remains open, and the camera can observe those facts over many control cycles. Recent systems have started to add audio, but in most cases sound is still treated as an episodic input before action or as a channel mainly for speech understanding.

Sound-centric manipulation is different. Here the critical evidence may be a brief beep, a collision click, a subtle change in boiling sound, or the prosody of a spoken confirmation. These cues are often short, non-repeatable, and only meaningful if the robot listens at the right moment during execution. The problem is not just adding one more modality, but dealing with the fact that audio is distributed over time very differently from vision.

Existing ways of attaching sound to VLA are often too static for this setting. Automatic speech recognition throws away non-speech cues and prosody, while waveform-as-image adapters flatten a temporal signal into a single visual snapshot. Both approaches become brittle when the cue is brief, when the cue happens between two policy queries, or when the task depends on how sound evolves rather than on one isolated event.

This is exactly where modern action chunking becomes a problem. Large policies often predict a chunk of actions and execute it open-loop to keep motion smooth. During that interval, new observations cannot change the current command sequence. For sound-centric tasks, this creates a structural blind spot: a short cue can happen and vanish before the next decision ever sees it. We refer to this as the Blind Execution Interval.

These limitations motivate the Vision-Sound-Language-Action paradigm. VSLA extends VLA from “see and act” to “see, hear, remember, and react” under delayed asynchronous control. In this paper we instantiate that paradigm with HEAR, and we accompany it with a pretraining dataset and a sound-causal benchmark so that the whole pipeline can be trained and evaluated consistently.

HEAR framework and sound-centric manipulation overview

HEAR reframes manipulation from a vision-dominant setting into a sound-causal one: the policy must preserve fleeting acoustic evidence, reason over multi-sensory context, and only act after the right sound condition has actually occurred.

What We Propose

Our response has three parts. First, we formalize VSLA as a control setting in which each decision receives delayed multi-view vision, a causal audio segment, language, and proprioception. Second, we design HEAR as a concrete architecture for that setting. Third, we build the surrounding resources needed to train and test such a system at scale.

HEAR is designed around two bottlenecks identified in the paper. The first is causal persistence: transient cues must survive across execution gaps. The second is temporal grounding: during long waiting periods with quasi-static vision, the policy still needs a sense of how the task is progressing over time.

Beyond the model itself, the paper also contributes OpenX-Sound for large-scale audio-augmented pretraining and HEAR-Bench for strict sound-causal evaluation. This matters because a system like HEAR is only meaningful if the benchmark explicitly rejects visually plausible but acoustically wrong behavior.

VSLA

A control paradigm for delayed robot decisions conditioned on vision, streaming sound, language, and proprioception.

HEAR

An architecture that couples causal audio memory with multimodal reasoning, temporal prediction, and smooth action generation.

OpenX-Sound

An audio-augmented pretraining resource that extends Open X-Embodiment with synchronized sound.

View dataset

HEAR-Bench

A sound-causal benchmark that rejects premature actions and tests whether the policy truly listens before acting.

How HEAR Works

The overall logic of HEAR follows the same progression as the problem itself. If sound can be lost between two policy queries, the system first needs a memory mechanism. If long waiting periods make the task temporally ambiguous, the system then needs a representation that understands progression over time. If the final policy is executed in chunks, it also needs an action generator that stays smooth enough to avoid introducing unnecessary ego-noise.

HEAR processes sound continuously, reasons over integrated multi-sensory context at decision time, learns temporal progression through future-audio prediction, and produces smooth action chunks for execution.

Historizer

The Historizer is a streaming stateful transformer that updates a compact causal memory from incoming audio packets. It directly targets the BEI issue by preserving short-lived cues that would otherwise disappear from a fixed per-query window.

Instead of forcing the backbone to process very long raw windows, it carries forward a compressed memory state aligned with real execution gaps. This is what lets the next decision know that a critical cue just happened even if the cue itself has already vanished.

In other words, the Historizer is the part that keeps the system causally aware of the recent acoustic past.

Envisioner

The Envisioner is a hierarchical multimodal reasoner. A high-level omni-model fuses vision, language, proprioception, current audio window, and Historizer memory into semantic context; a low-level model reuses this context for control-oriented features.

This split is important because sound-centric tasks often contain visually similar phases with different intent, such as waiting, monitoring, reacting, or overriding an action already in progress. The Envisioner turns those subtle differences into a control-ready representation.

It is also the module that lets HEAR bind sound evidence to task context, instead of treating audio as an isolated side channel.

Advancer

The Advancer predicts near-future audio codes from latent representations. This auxiliary objective injects explicit temporal grounding, helping the policy disambiguate long quasi-static waiting phases.

This matters most in process-monitoring tasks, where the robot must detect progression over time rather than respond to a single spike. The Advancer encourages the latent space to encode how the scene is evolving acoustically.

This turns audio from a mere detection signal into a source of temporal structure for long-horizon control.

Realizer

The Realizer transforms the inferred control state into smooth joint-position action chunks using conditional flow matching. This design is not just about nicer motion. In sound-centric manipulation, jerky behavior injects ego-noise and can mask weak task-relevant acoustic cues.

By generating smoother chunks, the Realizer improves both control quality and acoustic observability, which is especially important in tasks that alternate between waiting, listening, and fast reaction.

The Realizer closes the loop by making the action side consistent with the auditory demands of the task.

Training Data and Evaluation Pipeline

The paper argues that a new architecture alone is not enough. Sound-centric manipulation also lacks the large-scale pretraining data and the timing-aware benchmarks that made VLA progress possible in the first place. For that reason, HEAR is paired with both a training resource and a dedicated evaluation suite.

OpenX-Sound augments Open X-Embodiment trajectories with synchronized audio so that the model can first learn broad multi-sensory representations. After that pretraining stage, HEAR is fine-tuned on task-specific demonstrations and evaluated in settings where sound is genuinely necessary for success.

HEAR-Bench is designed specifically for sound-causal manipulation. Its tasks span four categories of acoustic cues: event-triggered alarms, human speech, continuous process sounds, and physical interaction feedback. Across the suite, success is timing-sensitive: if the robot reaches the goal before the required sound cue occurs, the rollout is marked as failure.

OpenX-Sound Dataset

We release the audio-augmented pretraining data on Hugging Face, providing synchronized sound for large-scale robot trajectory learning under the VSLA setting.

Open Dataset

Simulation benchmark

The seven simulation tasks are carefully chosen to cover complementary sound-centric challenges. Alarm Clock and Microwave test waiting and transient triggers; Check Yes and Interrupt test speech timing and prosody; Check Materials tests impact acoustics; Pour Water and Boil Water test long-horizon process monitoring.

Real-robot deployment

The physical tasks increase difficulty with room reverberation, background noise, mechanical ego-noise, and stronger visual aliasing. Moka Coffee requires process monitoring, Answer Phone is a multi-stage speech-and-event task, Shake Bottle uses active acoustic sensing, and Real Alarm Clock tests robust waiting under domain shift.

These experiments are intentionally constructed rather than merely collected. Each task isolates a specific failure mode of memoryless audio integration, such as transient cue dropout, prosody loss, visual aliasing, or long-horizon process monitoring. Under this evaluation protocol, HEAR reaches 81% average success in simulation and 54% on real-robot tasks, with especially strong gains on brief non-speech triggers and process-monitoring scenarios.

Experiment Videos

This page currently highlights only real-robot deployments. The selected tasks focus on long-horizon monitoring, active acoustic sensing, speech-conditioned progress tracking, and transient cue reaction under real-world noise and reverberation. Each example is designed to answer a concrete question about what kind of sound-causal competence the model has learned. Videos auto-play when they enter the viewport and pause once they leave it.

External-View Real-Robot Deployments

Robot Onboard Input Recordings

The videos below are captured from the robot's own sensing streams: head camera, hand camera, and synchronized microphone audio. They supplement the external-view demonstrations above by showing the actual visual and acoustic inputs available to the policy during execution, rather than footage from a third-person camera.

Robustness Ablations with Added Disturbances

These ablation recordings stress the same robot-input pipeline under controlled acoustic, sensor-placement, and scene-layout disturbances. Each video card below explicitly lists the perturbations tested in that trial, making the robustness coverage visible at the task level.

Traffic noise White noise Robot ego-noise Multi-person dialogue noise Microphone position shifts Object position shifts Extra distractor objects Phone speech-content variants

Moka Coffee - ablation

Robustness trial for continuous process monitoring where the robot must still detect the acoustic transition to the sputtering phase despite real deployment disturbances.

Robustness stressors

Traffic noise
White noise
Robot ego-noise
Multi-person dialogue noise
Microphone position shift
Object position shift
Extra distractor objects

Answer Phone - ablation

Robustness trial for speech-and-event progress tracking, testing whether the policy remains grounded in the current spoken instruction and call state under confounding audio and language variants.

Robustness stressors

Traffic noise
White noise
Robot ego-noise
Multi-person dialogue noise
Microphone position shift
Object position shift
Extra distractor objects
Speech content variants

Shake Bottle (Empty) - ablation

Robustness trial for active acoustic sensing, where the empty-bottle judgment must remain stable even when the self-generated shaking cue is degraded or visually confounded.

Robustness stressors

Traffic noise
White noise
Robot ego-noise
Multi-person dialogue noise
Microphone position shift
Object position shift
Extra distractor objects

Shake Bottle (Occupied) - ablation

Companion robustness trial for the occupied-bottle condition, testing whether the policy still distinguishes the filled-state rattling response under the same disturbance family.

Robustness stressors

Traffic noise
White noise
Robot ego-noise
Multi-person dialogue noise
Microphone position shift
Object position shift
Extra distractor objects

Real Alarm Clock - ablation

Robustness trial for transient alarm detection, where the robot must avoid premature pressing and still react only after the alarm cue appears under acoustic and layout disturbances.

Robustness stressors

Traffic noise
White noise
Robot ego-noise
Multi-person dialogue noise
Microphone position shift
Object position shift
Extra distractor objects

Paper PDF

Read the paper directly here or open the arXiv PDF in a new tab.

Open arXiv PDF

BibTeX

@article{nie2026visionsoundlanguageactionparadigmhearframework,
  title={Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation},
  author={Chang Nie and Tianchen Deng and Guangming Wang and Zhe Liu and Hesheng Wang},
  year={2026},
  eprint={2603.16086},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2603.16086},
}

Project Resources

From VLA to VSLA

HEAR Framework

Evaluation Design

OpenX-Sound Dataset

Experiment Videos

Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation

From VLA to VSLA

What We Propose

VSLA

HEAR

OpenX-Sound

HEAR-Bench

How HEAR Works

Historizer

Envisioner

Advancer

Realizer

Training Data and Evaluation Pipeline

OpenX-Sound Dataset

Simulation benchmark

Real-robot deployment

Experiment Videos

External-View Real-Robot Deployments

Robot Onboard Input Recordings

Robustness Ablations with Added Disturbances

Paper PDF

BibTeX

Towards the Vision-Sound-Language-Action Paradigm:
The HEAR Framework for Sound-Centric Manipulation