We introduce a novel multi-modal world model that integrates wireless sensing with video input to enhance environmental understanding and predictive capabilities. Unlike traditional world models that rely primarily on visual data, our approach harnesses the complementary strengths of wireless sensing to detect and interpret spatial and contextual information beyond the camera's field of view and in challenging visibility conditions. Furthermore, our model can capture physical properties invisible to RGB cameras that significantly influence future video data. By fusing data from wireless and video modalities, our model achieves a richer and more comprehensive representation of the environment, leading to improved accuracy in predicting object dynamics and spatial relationships.
We evaluate our multi-modal world model across diverse scenarios and demonstrate significant advantages over vision-only models. This work opens new possibilities for applications ranging from autonomous navigation to human-robot interaction, establishing a foundation for future multi-sensory AI systems.
Original | Vista | Ours |
---|---|---|
FVD: 296
|
FVD: 255
|
|
FVD: 294
|
FVD: 267
|
|
FVD: 624
|
FVD: 505
|
|
FVD: 134
|
FVD: 82
|
Original | SVD | Ours |
---|---|---|
Original (Infrared) | SVD | Ours |
---|---|---|
Original (RGB) | Original (Infrared) | SVD | Ours |
---|---|---|---|