Project: #IITM-251101-191

Multimodal AI for Autonomous Navigation: Integrating Vision and Language Models

Campus: Melbourne Campus

Available

The field of autonomous navigation has traditionally relied on sensor fusion from modalities such as LiDAR, IMU, GPS, and RGB cameras to interpret and navigate through physical environments. While these approaches have achieved remarkable success in structured and controlled settings, they often struggle in complex, dynamic, and human-centric environments where semantic understanding and contextual reasoning are critical. Recent breakthroughs in multimodal artificial intelligence, particularly the integration of vision and language models (VLMs), offer promising new avenues to enhance perception, reasoning, and decision-making capabilities in autonomous systems.

This project proposes to investigate the integration of vision-language models with conventional sensor-based navigation systems to develop robust, context-aware, and semantically informed navigation strategies for autonomous mobile robots. By combining the spatial awareness provided by vision and LiDAR with the high-level semantic abstraction of large language models, this research aims to bridge the gap between raw sensory data and actionable navigation strategies. Specifically, the project will explore three core research directions: (1) semantic scene understanding using pre-trained or fine-tuned vision-language models

(2) natural language-guided navigation that allows robots to follow human instructions or queries

and (3) motion planning enhanced by multimodal reasoning to infer affordances, predict human intent, and adaptively re-plan in real-time.

The novelty of this research lies in its systematic integration of multimodal AI—especially foundation models—with traditional robotic perception and planning pipelines. While prior work has examined these modalities in isolation, a unified framework that leverages their complementary strengths for real-world navigation remains an open challenge. This project will also fill a significant research gap by evaluating these methods across diverse environments, including indoor service robotics and outdoor pedestrian-rich settings, where ambiguity, occlusion, and dynamic obstacles are prevalent.

The main objectives of this project are:

1. To design a multimodal architecture that fuses sensor data with vision-language representations for enhanced environmental understanding.

2. To develop and evaluate navigation policies guided by natural language instructions.

3. To incorporate reasoning-based motion planning that leverages both low-level sensory cues and high-level semantic goals.

4. To benchmark the system’s performance in simulation and real-world environments, focusing on robustness, adaptability, and human-robot interaction.

By the end of this project, we aim to contribute foundational insights and tools toward building autonomous systems that can navigate safely, intelligently, and cooperatively in environments shared with humans. The outcomes will have direct implications for a range of applications including assistive robotics, urban mobility, logistics, and disaster response.

Multimodal AI for Autonomous Navigation: Integrating Vision and Language Models

Bijo Sebastian

bijo.sebastian@iitm.ac.in

Akan Cosgun

akan.cosgun@deakin.edu.au