Vision-Guided Quadrupedal Locomotion in the Wild with Multi-Modal Delay Randomization

1UC San Diego, 2Tsinghua University

IROS 2022



Developing robust vision-guided controllers for quadrupedal robots in complex environments with various obstacles, dynamical surroundings and uneven terrains is very challenging. While Reinforcement Learning (RL) provides a promising paradigm for agile locomotion skills with vision inputs in simulation, it is still very challenging to deploy the RL policy in the real world. Our key insight is that aside from the discrepancy in the observation domain gap between simulation and the real world, the latency from the control pipeline is also a major cause of the challenge. In this paper, we propose Multi-Modal Delay Randomization (MMDR) to address this issue when training with RL agents. Specifically, we randomize the selections for both the proprioceptive state and the visual observations in time, aiming to simulate the latency of the control system in the real world. We train the RL policy for end2end control in a physical simulator, and it can be directly deployed on the real A1 quadruped robot running in the wild. We evaluate our method in different outdoor environments with complex terrain and obstacles. We show the robot can smoothly maneuver at a high speed, avoiding the obstacles, and achieving significant improvement over the baselines.

Depth map and corresponding video

Our method works in diverse cases


We leverage End-to-end RL to control the robot. We use separate encoders for multi-modal inputs to get domain-specific features and use a MLP to get the value or action distribution from concatenated features.

Challenges of sim-to-real regarding policy execution latency in the real robot can be dealt with by randomizing the various modeled latencies in the simulation during training. We propose to address this latency in deploying learned multi-modal policy with Multi-Modal Delay Randomization (MMDR). To simulate the proprioceptive latency, we sample a proprioceptive delay and use linear interpolation to calculate the delayed observation from two adjacent states in the entire buffer with the sampled delay. To simulate the perception latency for visual observation at a lower frequency, we obtain the simulated visual observation at every control step and store them in the the visual observation buffer maintaining visual observation in the near past. As illustrated in the following figure, we store the most recent 4k depth maps as our visual observations buffer, split the whole visual observation buffer into four sub-buffers, then sample one depth map from each sub-buffer to create the visual input with randomized latency.


Diverse locomotion skills (0.5x real time)







No Delay

Dense Box


No Delay

Moving Box


No Delay



No Delay


 title={Vision-Guided Quadrupedal Locomotion in the Wild with Multi-Modal Delay Randomization}, 
 author={Chieko Imai and Minghao Zhang and Yuchen Zhang and Marcin Kierebinski and Ruihan Yang and Yuzhe Qin and Xiaolong Wang}, 
 booktitle={2022 IEEE/RSJ international conference on intelligent robots and systems (IROS)},