Beyond Vision: A Multi-Modal World Model

We introduce a novel multi-modal world model that integrates wireless sensing with video input to enhance environmental understanding and predictive capabilities. Unlike traditional world models that rely primarily on visual data, our approach harnesses the complementary strengths of wireless sensing to detect and interpret spatial and contextual information beyond the camera's field of view and in challenging visibility conditions. Furthermore, our model can capture physical properties invisible to RGB cameras that significantly influence future video data. By fusing data from wireless and video modalities, our model achieves a richer and more comprehensive representation of the environment, leading to improved accuracy in predicting object dynamics and spatial relationships.

We evaluate our multi-modal world model across diverse scenarios and demonstrate significant advantages over vision-only models. This work opens new possibilities for applications ranging from autonomous navigation to human-robot interaction, establishing a foundation for future multi-sensory AI systems.

Original	Vista	Ours
	FVD: 296	FVD: 255
	FVD: 294	FVD: 267
	FVD: 624	FVD: 505
	FVD: 134	FVD: 82

Original (RGB)	Original (Infrared)	SVD	Ours

Beyond Vision: A Multi-Modal World Model

Abstract

Low-light Road Condition w/ LiDAR & Radar

Through-wall Human Movement w/ WiFi

Walking in Low-visibility w/ Infrared

Temperature-aware Action w/ Infrared