Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs Paper • 2504.00072 • Published Mar 31 • 6
CoVR: Learning Composed Video Retrieval from Web Video Captions Paper • 2308.14746 • Published Aug 28, 2023 • 2
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Paper • 2302.14115 • Published Feb 27, 2023
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models Paper • 2206.08155 • Published Jun 16, 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers Paper • 2203.16434 • Published Mar 30, 2022
Just Ask: Learning to Answer Questions from Millions of Narrated Videos Paper • 2012.00451 • Published Dec 1, 2020