STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment

1 KAIST, 2 UNC Chapel Hill, 3 NAVER AI Lab 4 DeepAuto

*Indicates Equal Contribution
Code arXiv

Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world. However, this is a nontrivial problem and poses two critical challenges: sparse spatio-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. To tackle this problem, we propose a new continual audio-video pre-training method with two novel ideas: (1) Localized Patch Importance Scoring: we introduce a multimodal encoder to determine the importance score for each patch, emphasizing semantically intertwined audio-video patches. (2) Replay-guided Correlation Assessment: to reduce the corruption of previously learned audiovisual knowledge due to drift, we propose to assess the correlation of the current patches on the past steps to identify the patches exhibiting high correlations with the past steps. Based on the results from the two ideas, we perform probabilistic patch selection for effective continual audio-video pre-training. Experimental validation on multiple benchmarks shows that our method achieves a 3.69%p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by ~45%.

Motivation

Emerging new audio-video semantics

Teaser
Outdated pre-trained audio-video models struggle with understanding emerging new audio-video semantics.

Challenges in continual audio-video learning

Teaser
Learning audio-video data with continuously changing semantic categories is a nontrivial problem due to two critical challenges: 1) sparse spatio-temporal correlation between audio-video pairs, and 2) multimodal correlation overwriting.

Challenge of multimodal correlation overwriting

Teaser
During continual pre-training, the model can encounter new semantics sharing key visual objects, making the model overwrite the previously learned audio information, resulting in forgetting.

Method

Teaser
Our method harnesses cross-modal attention maps from the AVM module to compute importance scores in order to identify highly correlated patches (Localized Patch Importance Scoring). Comparing the attention maps created by the current queries with those generated by past queries, we compute correlation scores of the current patches with the past data (Replay-guided Correlation Assessment). Finally, we perform a probabilistic patch selection, combining the importance scores and correlation scores to select patches for continual audio-video pre-training (Multimodal Patch Selection for Continual Learning).

Experiment Results

Continual-VS, Continual-AS - Zero-shot audiovisual retrieval tasks

Teaser
We split VGGSound and AudioSet datasets into multiple tasks based on its high-level category information, naming each as Continual-VS and Continual-AS, respectively. The table shows results of audiovisual zero-shot retrieval task on the Continual-VS and Continual-AS. Our methods outperform strong baselines, especially in R1 score.

Downstream performance on various rehearsal memory sizes

Teaser
We explore the influence of rehearsal memory size on audiovisual zero-shot retrieval task performances. Our method consistently surpass other baselines, underscoring their effectiveness in adapting to diverse memory constraints.

Audiovisual downstream tasks

Teaser
We use the models continually pre-trained until the completion of the last task of Continual-VS, and finetune the models on various audiovisual downstream tasks. Overall, our methods acquire transferable audio-video representations that lead to high performances in diverse tasks.

Sound source localization with the AVE dataset

Teaser
We perform a sound source localization task on the AVE dataset to evaluate the model's ability to detect sound sources within visual scenes. Compared to other baselines, our AVM module in STELLA stands out by precisely identifying the correct sound source.

Analysis

Efficiency analysis

Teaser
We measure GPU memory occupancy (GPU M.) in GB and Throughput (T.P.) in sample/sec. Both are estimated in single V100 with a batch size of 15 for STELLA++ and 9 for other methods. STELLA+ achieves an efficiency gain of 44.59% and ranks the second best throughput. STELLA++ shows the benefit of the reduced GPU memory consumption of our method, resulting in improved performance compared to other baselines.

Analysis on core sampling methods in STELLA

Teaser
We experiment on the effect of our two core sampling methods: (1) LPIS: Localized Patch Importance Scoring and (2) RCA: Replay-guided Correlation Assessment.

Modality gap estimation

Teaser
During continual pre-training, we estimate the modality gap at the end of each task. In the context of continual audio-video pre-training, maintaining a large modality gap between the two modalities throughout tasks is desirable, as deviating from it suggests a departure from the optimal state.

BibTeX

@inproceedings{lee2024stella,
      title={STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment},
      author={Jaewoo Lee and Jaehong Yoon and Wonjae Kim and Yunji Kim and Sung Ju Hwang},
      year={2024},
      booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
}