Beyond Vision: A Multi-Modal World Model

Abstract

We introduce a novel multi-modal world model that integrates wireless sensing with video input to enhance environmental understanding and predictive capabilities. Unlike traditional world models that rely primarily on visual data, our approach harnesses the complementary strengths of wireless sensing to detect and interpret spatial and contextual information beyond the camera's field of view and in challenging visibility conditions. Furthermore, our model can capture physical properties invisible to RGB cameras that significantly influence future video data. By fusing data from wireless and video modalities, our model achieves a richer and more comprehensive representation of the environment, leading to improved accuracy in predicting object dynamics and spatial relationships.

We evaluate our multi-modal world model across diverse scenarios and demonstrate significant advantages over vision-only models. This work opens new possibilities for applications ranging from autonomous navigation to human-robot interaction, establishing a foundation for future multi-sensory AI systems.


Low-light Road Condition w/ LiDAR & Radar

Original Vista Ours
GIF 1
Animated GIF
FVD: 296
Animated GIF
FVD: 255
GIF 1
Animated GIF
FVD: 294
Animated GIF
FVD: 267
GIF 1
Animated GIF
FVD: 624
Animated GIF
FVD: 505
GIF 1
Animated GIF
FVD: 134
Animated GIF
FVD: 82

Through-wall Human Movement w/ WiFi

Original SVD Ours
GIF 1 GIF 1 GIF 1
GIF 1 GIF 1 GIF 1
GIF 1 GIF 1 GIF 1

Walking in Low-visibility w/ Infrared

Original (Infrared) SVD Ours
GIF 1 GIF 1 GIF 1

Temperature-aware Action w/ Infrared

Original (RGB) Original (Infrared) SVD Ours
GIF 1 GIF 1 GIF 1 GIF 1
GIF 1 GIF 1 GIF 1 GIF 1
GIF 1 GIF 1 GIF 1 GIF 1