large language models, multimodal learning, natural language processing, machine learning, artificial intelligence
Video Action Recognition, Motion Vectors, CLIP Features, Multi-modal Fusion, Compressed Domain