LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos

New York University

Abstract

In this paper we introduce LifelongMemory, a new framework for accessing long-form egocentric videographic memory through natural language question answering and retrieval. LifelongMemory generates concise video activity descriptions of the camera wearer and leverages the zero-shot capabilities of pretrained large language models to perform reasoning over long-form video context. Furthermore, LifelongMemory uses a confidence and explanation module to produce confident, high-quality, and interpretable answers. Our approach achieves state-of-the-art performance on the EgoSchema benchmark for question answering and is highly competitive on the natural language query (NLQ) challenge of Ego4D.


Pipeline

MY ALT TEXT

Stage 1: Video Captioning

A multi-modal LLM (MLLM) produces captions from a list of short video clips. Content and query similarity filters are then applied to remove redundant and irrelevant captions.

Stage 2: LLM Reasoning

An LLM is instructed to take inputs from the list of condensed captions and retrieve the most relevant interval candidates.

MY ALT TEXT

Stage 3: Output Refinement

For video QA, ensemble the predictions of multiple runs using vote by confidence.

MY ALT TEXT

For NLQ, feed candidate intervals predicted by our previous stage into a pretrained NLQ model to obtain a finegrained prediction.

Visualization

We visualize the raw predictions of the LLM below. The LLM generates high-quality results without any post-processing.

Quantitative Evaluation

Our approach achieves state-of-the-art performance on the EgoSchema benchmark for question answering and is highly competitive on the natural language query (NLQ) challenge of Ego4D.

MY ALT TEXT
MY ALT TEXT

We also quantitatively evaluate our proposed framework with different captioning models, LLMs, pre-processing techniques, prompts. Please read our paper for more details.

Video Presentation

BibTeX

@misc{wang2024lifelongmemory,
      title={LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos}, 
      author={Ying Wang and Yanlai Yang and Mengye Ren},
      year={2024},
      eprint={2312.05269},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}