AVQA: A Dataset for Audio-Visual Question Answering on Videos
Pinci Yang1, Xin Wang1*, Xuguang Duan1, Hong Chen1, Runze Hou1, Cong Jin2, Wenwu Zhu1*
1Media and Network Lab, Tsinghua University, 2Communication University of China


# UPDATE

  • 10 Oct 2022: This paper (opens new window) is published!
  • 9 Oct 2022: The dataset has been uploaded, welcome to download (opens new window) and use!
  • 29 June 2022: Our paper is accepted for publication at ACM Multimedia 2022. Camery-ready version will be released soon!

# ABOUT


Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video. For example, given a video showing a traffic intersection where the light turns red and the parking stick drops, and the question “why did the stick fall in the video?”, it requires to combine the visual information “the stick dropping” and the audio information of a train whistle to answer the question as “Here comes the train”. To achieve an accurate reasoning process and get the correct answer, it is essential to extract cues and contexts from both audio and visual modalities and discover their inner causal correlations.

Real-life scenarios contain more complex relationships between audio-visual objects and a wider varieties of audio-visual daily activities. AVQA is an audio-visual question answering dataset for the multimodal understanding of audio-visual objects and activities in real-life scenarios on videos. AVQA provides diverse sets of questions specially designed considering both audio and visual information, involving various relationships between objects or in activities.

# Data Statistics

Our AVQA dataset aims to reason about multiple audio-visual relationships in real-life scenarios. We compare our AVQA dataset with other datasets from six aspects in the following table.(Sound types: B-Background sound, S-Human speech, O-Object sound.)

# DOWNLOADS

# Data and Download

  • Raw videos:

We aim to evaluate the reasoning ability of question answering models in real-life audio-visual scenarios, so the video corpus should have a considerable scale and contain rich and generic classes. Therefore, we choose the audio-visual dataset VGG-Sound (opens new window) as our source of video clips, which consists of 200k videos for 309 audio classes.

To download raw videos, we provide a csv file following the download method mentioned in VGG-Sound. For each YouTube video, we provide YouTube URLs, time stamps, and train/test split.

# YouTube ID, start seconds, train/test split
  • Annotations(QA pairs, etc.):
    • Train set: train_qa.json
    • Test set: val_qa.json

The annotation process is fully manual, which makes AVQA superior to template-generated datasets and semi-manual annotated datasets from annotation accuracy and semantic diversity.

The csv file and annotation json files are available for download (opens new window).

# Details of Annotation Files

The annotation files are stored in JSON format. For example,

    {
        "id": 1205,
        "video_name": "0AEJTlHIhz0_000358",
        "video_id": 1467,
        "question_text": "Why do the people in the video scream?",
        "multi_choice": [
            "Roller coaster",
            "On a pirate ship",
            "Take Ferris wheel",
            "Take the jumping machine"
        ],
        "answer": 0,
        "question_relation": "Both",
        "question_type": "Why"
    }

Below, we present a detailed explanation of each keyword.

  • id: the unique identifier to QA pairs.
  • video_name: YouTube ID (the original video name from VGGSound (opens new window) dataset).
  • video_id: the unique identifier to video clips.
  • question_text: the question content of human-annotated question-answer pairs.
  • multi_choice: the multiple choices' content (1 correct answer and 3 confusion options) of human-annotated question-answering pairs.
  • answer: index of the correct answer.
  • question_relation: types of modalities required for answering the questions, including "View", "Sound" and "Both".
  • question_type: types of question semantics, including "Which", "Come From", "Happening", "Where", "Why", "Before Next", "When", "Used For".

# Extracted Feature Files

We give an example to extract visual and audio features from raw videos, referencing code in HCRN (opens new window) and PANNs (opens new window).

  • (Appearance Features) ResNet101 feature shape: [dataset_size, 8, 16, 2048] (98 G)
  • (Motion Features) ResNeXt101 feature shape: [dataset_size, 8, 2048] (6.1 G)
  • (Audio Features) PANNs feature shape: [dataset_size, 8, 2048] (779M)

Our example extracted audio and visual feature files are available for download (opens new window). We also welcome more unique feature extraction methods to be applied on our dataset.

# METHOD

To solve the challenging Audio-Visual Question Answering problems, we propose a Hierarchial Audio-Visual Fusing (HAVF) module that can be flexibly combined with existing video question answering models. And to benchmark different models, we use answer prediction accuracy as the evaluation metric and evaluate performance of different models on answering different question types. More details in the Paper (opens new window).

We take pre-trained PANNs and ResNet to extract audio features and visual features and obtain the question embedding through an LSTM model for the feature extraction stage. Then, we combine three fusion modules (EAVF, MF, LAF) with the baseline model to fuse features generated from audio, visual, and text modalities. Finally, the Hierarchical Ensemble module combines the advantages of three fusing methods and uses integrated tri-modal information to predict the answer to the input question.

# Experimental Setup

We evaluate our dataset and our Hierarchical Audio-Video Fusing(HAVF) module with six well-known and state-of-the-art video question answering models.

# Experimental Results

We show fine-grained testing performance of baselines and our proposed Baseline-HAVF method. The best performance over each question type is highlighted in bold form. The largest performance increase caused by our HAVF module over each question type is highlighted in underline form.

# LICENSE

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

# Publication

If you find our work useful in your research, please cite our ACM MM 2022 paper (opens new window).

# Acknowledgement

This work is supported by the National Key Research and Development Program of China No. 2018AAA0102001 and National Natural Science Foundation of China No. 62250008, No. 62102222.