SPIN: Simultaneous Perception,
Interaction and Navigation

Shagun Uppal    Ananye Agarwal    Haoyu Xiong    Kenneth Shaw    Deepak Pathak   

CVPR 2024


While there has been remarkable progress recently in the fields of manipulation and locomotion, mobile manipulation remains a long-standing challenge. Compared to locomotion or static manipulation, a mobile system must make a diverse range of long-horizon tasks feasible in unstructured and dynamic environments. While the applications are broad and interesting, there are a plethora of challenges in developing these systems such as coordination between the base and arm, reliance on onboard perception for perceiving and interacting with the environment and most importantly, simultaneously integrating all these parts together. Prior works approach the problem using disentangled modular skills for mobility and manipulation that are trivially tied together. This causes several limitations such as compounding errors, delays in decision-making and no whole-body coordination. In this work, we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. Similar to how humans leverage whole-body and hand-eye coordination, we develop a mobile manipulator that exploits its ability to move and see, more specifically -- to move in order to see and to see in order to move. This allows it to not only move around and interact with its environment but also, choose when to perceive what using an active visual system. We observe that such an agent learns to navigate around complex cluttered scenarios while displaying agile whole-body coordination using only ego-vision without needing to create environment maps.

Key Features

Image 1

Learn, Adapt and React on the move

Image 2

Optimizes action-perception jointly

Image 3

Trained with large-scale randomization

All videos play at 2x speed

Mobile Manipulation with Hand-Eye Coordination

SPIN looks around for finding the desired object using its actuated camera. Once the target object is located, the robot navigates towards the object to pick it up. Displaying hand-eye coordination, it adjust its base and arm to get a better grasp of the object by trying to keep the object within its field of view even while moving.

Emergent Whole-Body and Dynamic Obstacle Avoidance

We observe emergent whole-body coordination for obstacle avoidance where the robot learns to move its arm to navigate across floating and dynamic obstacles efficiently without re-routing or re-planning base movement. Without any mapping or planning, active perception allows the robot to pan its head camera in order to aggregate useful information about its environment and dynamically adapt according to it.

SPIN cleans up the table

SPIN can repeatedly fetch different objects and clean up a messy table. With an actuated camera head, whole-body coordination and robustness to random initializations, SPIN can seamlessly performs manipulation several times in a go.

Robustness to Adversarial Scenarios

The policy shows robustness to a variety of adversarial conditions. In the fisrt video, we see an adversarial case of dynamic obstacle avoidance where humans actively try to block its path several times, but the robot continues to turn and re-route continously.

In the second video, the robot nagivates in dimly-lit clutter in an outdoor environment. Even with the rough and bumpy floor which causes inaccuracies in the wheel odometry and jittery motion, the robot adjusts its movement.

Fetching a diverse set of objects

SPIN can fetch a diverse set of objects ranging from rigid to deformable of different shapes, sizes and masses, such as a plastic cup or a fruit or a soft toy.


      author    = {Uppal, Shagun and Agarwal, Ananye and Xiong, Haoyu and Shaw, Kenny and Pathak, Deepak},
      title     = {SPIN: Simultaneous Perception, Interaction and Navigation},
      journal   = {CVPR},
      year      = {2024},


We thank Jared Mejia and Mihir Prabhudesai for helping with stress-testing in real-world experiments. We are also grateful to Zackory Erickson and the Hello Robot team for their support with the robot hardware. This work was supported in part by grants including ONR N00014-22-1-2096, AFOSR FA9550-23-1-0747, and the Google research award to DP.