DialNav: Embodied Dialog for Navigation with a Remote Guide

Abstract

We introduce DialNav, a novel dialog-based collaborative navigation task, where an embodied agent (Navigator) and a remote guide (Guide) engage in multi-turn dialog to reach a goal location. Unlike prior works, DialNav aims for holistic evaluation and requires the Guide to infer the Navigator's location, making communication essential for task success. To support this task, we collect and release Remote Assistance in Navigation (RAIN) dataset, human-human dialog paired with navigation trajectories in photorealistic environments. We design a comprehensive benchmark, evaluating navigation and dialog, and conduct extensive experiments analyzing the impact of different Navigator and Guide models. We highlight key challenges and publicly release the dataset, code, and evaluation framework to foster advancements in dialog-based embodied AI.

Method

Navigator is tasked with reaching the goal region R (yellow circle in the map) from the initial node b based on an ambiguous initial instruction I, which provides only a hint about R (e.g., "The goal room contains a carpet"). During navigation, Navigator can ask questions to obtain additional guidance (orange text boxes). Guide has knowledge of the environment but lacks information of Navigator's location. For successful navigation, Guide must estimate Navigator's location through dialog D and provide directions from the estimated position to the goal region (blue text boxes). Note that each QA pair in D is mapped to a node (red and purple nodes mapped to dialog turns by dotted arrows) in the navigation trajectory T.

DialNav Task

DialNav Task Description. In DialNav, Navigator starts with an ambiguous instruction and gathers additional information through natural language dialog with Guide. Navigator must determine when to seek assistance and generate relevant questions. Guide, upon receiving a query, infers Navigator's location before providing a response, incentivizing Navigator to ask informative questions. Dialog is initiated by Navigator and follows an alternating turn-taking format, with no limit on QA turns.

RAIN Dataset

Dialog characteristics in RAIN dataset. We manually analyzed 100 randomly sampled episodes from RAIN and present various dialog characteristics in RAIN, along with an example. The Init. and Subs. columns indicate the frequencies of these characteristics in the first and subsequent dialog turns, respectively.

Qualitative Results

A qualitative DialNav episode with the initial instruction: 'The goal room contains a bookcase' The 3D reconstructed map at the top of each turn shows the same Matterport 3D environment from different angles, with ellipses representing nodes in Navigator's full path. A red node indicates where Navigator posed a question at the current turn, yellow nodes mark the goal region, and gray and white nodes represent unvisited and visited nodes, respectively, up to the current turn. Navigator formulates questions (orange boxes) with detailed descriptions of surrounding objects (highlighted with green circles in the 3D maps and corresponding images on the side), such as 'white bathtub' in the first turn, or long glass table}' in the second turn. Guide provides visual hints (blue circles), like 'white sofa'' in the second turn, along with path guidance (blue boxes) and instructions on where to stop (\eg, 'that's our goal room''), which are crucial for achieving high SR. Through this dialog, Navigator successfully reaches the goal region.

@article{han2025dialnav, title={DialNav: Multi-turn Dialog Navigation with a Remote Guide}, author={Han, Leekyeung and Min, Hyunji and Hwangbo, Gyeom and Choi, Jonghyun and Seo, Paul Hongsuck}, journal={arXiv preprint arXiv:2509.12894}, year={2025} }