LifelongMemory: Leveraging LLMs for Answering Queries in Egocentric Videos

New York University

Abstract

The egocentric video natural language query (NLQ) task involves localizing a temporal window in an egocentric video that provides an answer to a posed query, which has wide applications in building personalized AI assistants. Prior methods for this task have focused on improvements of network architecture and leveraging pre-training for enhanced image and video features, but have struggled with capturing long-range temporal dependencies in lengthy videos, and cumbersome end-to-end training. Motivated by recent advancements in Large Language Models (LLMs) and vision language models, we introduce LifelongMemory, a novel framework that utilizes multiple pre-trained models to answer queries from extensive egocentric video content. We address the unique challenge by employing a pre-trained captioning model to create detailed narratives of the videos. These narratives are then used to prompt a frozen LLM to generate coarse-grained temporal window predictions, which are subsequently refined using a pre-trained NLQ model. Empirical results demonstrate that our method achieves competitive performance against existing supervised end-to-end learning methods, underlining the potential of integrating multiple pre-trained multi- modal large language models in complex vision-language tasks. We provide a comprehensive analysis of key design decisions and hyperparameters in our pipeline, offering insights and practical guidelines.


Pipeline

MY ALT TEXT

Stage 1: Video Captioning

A multi-modal LLM (MLLM) produces captions from a list of short video clips. Content and query similarity filters are then applied to remove redundant and irrelevant captions.

Stage 2: LLM Reasoning

An LLM is instructed to take inputs from the list of condensed captions and retrieve the most relevant interval candidates.

MY ALT TEXT

Stage 3: Output Refinement

Feed candidate intervals predicted by our previous stage into a pretrained NLQ model to obtain a finegrained prediction.

Visualization

We visualize the raw predictions of the LLM below. The LLM generates high-quality results without any post-processing.

Quantitative Evaluation

We compare the performance of our method and two competitive methods on the Ego4D NLQ benchmark. We use LaViLa as the captioning model, GPT4 as the reasoning core in the experiment and NAQ++ as the NLQ model used in the refinement stage. While our method is not as good as the current state-of-the-art, it outperforms NaQ++ on the official test set (hosted on EvalAI), suggesting a more powerful NLQ model can further boost the performance. We also illustrate the potential of method with better captioning model by evaluating the performance on the validation set with the ground-truth captions from Ego4D.

MY ALT TEXT

We also quantitatively evaluate our proposed framework with different captioning models, LLMs, pre-processing techniques, prompts. Please read our paper for more details.

Video Presentation

BibTeX

@misc{wang2023lifelongmemory,
              title={LifelongMemory: Leveraging LLMs for Answering Queries in Egocentric Videos}, 
              author={Ying Wang and Yanlai Yang and Mengye Ren},
              year={2023},
              eprint={2312.05269},
              archivePrefix={arXiv},
              primaryClass={cs.CV}
        }