affiliation not provided to SSRN
Knowledge graph, Multimodal fusion, Image-text pairs, pre-trained model