affiliation not provided to SSRN
Cross-modal retrieval, text-video retrieval, video semantic compression, granularity alignment