Cross-Domain Multi-Style Merge for Image Captioning
8 Pages Posted: 14 Jul 2022
Multi-style image captioning has attracted wide attention recently. Existing approaches mainly rely on style synthetics within a single domain. They cannot deal with multiple styles combination since various styles naturally cannot be included in a uniform dataset.This paper is the first one to investigate the cross-domain multi-style merge for image captioning. Specifically, we propose a novel image caption model with a multi-style gated transformer block to fit the cross-domain caption generation task. Conventional generative adversarial learning for language methods may suffer from the distribution distortion problem, since real datasets do not contain captions with style combinations. Therefore, we devise a multi-stage self-learning framework for the proposed image caption model to exploit real corpus with pseudo styles gradually. Comprehensive experiments and ablation studies demonstrate the effectiveness of our proposed method on the multi-style merge for image captioning.
Keywords: Computer Vision, Vision & Language, Image Captioning, Multi-Style Caption Generation
Suggested Citation: Suggested Citation