Transforming Face Sketches into Realistic Images Using Hierarchical Attention in Swin Transformers
31 Pages Posted: 28 Mar 2025
Abstract
Sketch-to-image translation for photorealistic images is a difficult computer vision task since sketches are abstract and contain sparse information. This paper presents a novel solution using the SwinV2-B Transformer pre-trained on ImageNet-22K and fine-tuned sketch-to-image translation. The model acquires fine-grained details and global structures essential for realistic image synthesis through Swin's hierarchical design and shifted window self-attention. The suggested system utilizes a meticulously crafted dataset, state-of-the-art preprocessing methods, and tailor-made loss functions to achieve improved output quality. Results on the Structural Similarity Index Measure (SSIM) and Fréchet Inception Distance (FID) exhibit notable gains in image quality with an FID of 115.19 and an SSIM of 0.7982. Human assessments also confirm the retention of identity-critical details, rendering the system extremely suitable for forensic analysis and the entertainment industries. Despite these successes, issues of sensitivity to sketch quality and computational requirements remain. This paper concludes by considering possible applications and suggesting future research paths, such as model optimization for real-time synthesis and the combination of cross-modal inputs for richer output creation. Our code, model, and data are available at: https://github.com/MahyarHassani/Face-sketch-to-real-image-using-Swin-transformer-V2.
Keywords: Face Sketch Translation, Swin Transformers, Image Synthesis, Computer Vision, Deep Learning.
Suggested Citation: Suggested Citation