2 Lushan South Rd
Changsha, CA 410082
China
Hunan University
Spatial-temporal Video Grounding, Cross-modal Learning, Transformer, Contrastive learning
weakly-supervised learning, Spatial-Temporal Video Grounding, Single Frame Annotation, Multiple Instance Learning
Video Moment Retrieval, Supervised Fine-Tuning, Reinforcement Learning from Human Feedback, Multimodal Large Language Model