Self-Supervised Monocular Depth Estimation for Outdoor Scenes Using Dynamic Self-Distillation and Asymptotic Bidirectional Feature Aggregation

Song, Xiaogang; Liu, Jian; Zhao, Qin; Guo, Xinwei; Wei, Bingxing; Hei, Xinhong

doi:10.2139/ssrn.5245680

Download This Paper

Open PDF in Browser

Add Paper to My Library

Self-Supervised Monocular Depth Estimation for Outdoor Scenes Using Dynamic Self-Distillation and Asymptotic Bidirectional Feature Aggregation

24 Pages Posted: 7 May 2025

See all articles by Xiaogang Song

Monocular depth estimation plays a crucial role in computer vision, aiming to infer scene depth information from a single RGB image. However, acquiring large-scale high-quality ground-truth depth data is both expensive and time-consuming. Self-supervised methods leverage the inherent structural and geometric information within images, eliminating the reliance on a large number of ground-truth depth labels. In this paper, we propose DistilDepth, a novel self-supervised monocular depth estimation method that improves depth prediction through two novel ideas. First, we introduce the dynamic self-distillation (DSD) to provide additional supervision during training. Specifically, we construct a homogeneous teacher-student framework where the teacher model guides the student model, while dynamically adjusting the contribution of self-distillation based on photometric loss from self-supervised learning. Additionally, we employ a high-error mask that gradually focuses distillation learning on more challenging regions identified during the self-supervised learning. Second, we design the asymptotic bidirectional feature aggregation network (ABFANet), which achieves complementary enhancement between multi-scale features through progressive information interaction. In detail, we first perform feature fusion between adjacent layers, and then gradually introduce features from non-adjacent layers, progressively enhancing the semantic and detailed information of multi-scale features. Experimental results on the KITTI and Cityscapes dataset demonstrate the state-of-the-art performance of our method. In addition, the robust generalization ability of our method is validated on the Cityscapes and Make3D datasets.

Keywords: Monocular depth estimation, Self-supervised Learning, dynamic self-distillation, multi-scale feature aggregation

Suggested Citation: Suggested Citation

Song, Xiaogang and Liu, Jian and Zhao, Qin and Guo, Xinwei and Wei, Bingxing and Hei, Xinhong, Self-Supervised Monocular Depth Estimation for Outdoor Scenes Using Dynamic Self-Distillation and Asymptotic Bidirectional Feature Aggregation. Available at SSRN: https://ssrn.com/abstract=5245680 or http://dx.doi.org/10.2139/ssrn.5245680