affiliation not provided to SSRN
Object Segmentation, Vision Transformer, Multimodal learning, Segment Anything