Self-Supervised Monocular Depth Estimation for Outdoor Scenes Using Dynamic Self-Distillation and Asymptotic Bidirectional Feature Aggregation

24 Pages Posted: 7 May 2025

See all articles by Xiaogang Song

Xiaogang Song

affiliation not provided to SSRN

Jian Liu

affiliation not provided to SSRN

Qin Zhao

affiliation not provided to SSRN

Xinwei Guo

affiliation not provided to SSRN

Bingxing Wei

affiliation not provided to SSRN

Xinhong Hei

affiliation not provided to SSRN

Abstract

Monocular depth estimation plays a crucial role in computer vision, aiming to infer scene depth information from a single RGB image. However, acquiring large-scale high-quality ground-truth depth data is both expensive and time-consuming. Self-supervised methods leverage the inherent structural and geometric information within images, eliminating the reliance on a large number of ground-truth depth labels. In this paper, we propose DistilDepth, a novel self-supervised monocular depth estimation method that improves depth prediction through two novel ideas. First, we introduce the dynamic self-distillation (DSD) to provide additional supervision during training. Specifically, we construct a homogeneous teacher-student framework where the teacher model guides the student model, while dynamically adjusting the contribution of self-distillation based on photometric loss from self-supervised learning. Additionally, we employ a high-error mask that gradually focuses distillation learning on more challenging regions identified during the self-supervised learning. Second, we design the asymptotic bidirectional feature aggregation network (ABFANet), which achieves complementary enhancement between multi-scale features through progressive information interaction. In detail, we first perform feature fusion between adjacent layers, and then gradually introduce features from non-adjacent layers, progressively enhancing the semantic and detailed information of multi-scale features. Experimental results on the KITTI and Cityscapes dataset demonstrate the state-of-the-art performance of our method. In addition, the robust generalization ability of our method is validated on the Cityscapes and Make3D datasets.

Keywords: Monocular depth estimation, Self-supervised Learning, dynamic self-distillation, multi-scale feature aggregation

Suggested Citation

Song, Xiaogang and Liu, Jian and Zhao, Qin and Guo, Xinwei and Wei, Bingxing and Hei, Xinhong, Self-Supervised Monocular Depth Estimation for Outdoor Scenes Using Dynamic Self-Distillation and Asymptotic Bidirectional Feature Aggregation. Available at SSRN: https://ssrn.com/abstract=5245680 or http://dx.doi.org/10.2139/ssrn.5245680

Xiaogang Song (Contact Author)

affiliation not provided to SSRN ( email )

Jian Liu

affiliation not provided to SSRN ( email )

Qin Zhao

affiliation not provided to SSRN ( email )

Xinwei Guo

affiliation not provided to SSRN ( email )

Bingxing Wei

affiliation not provided to SSRN ( email )

Xinhong Hei

affiliation not provided to SSRN ( email )

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
8
Abstract Views
101
PlumX Metrics