Logo DREAM-1K

A challenging video Description benchmark
with Rich Events, Actions, and Motions

▶️

DREAM-1K Explorer

Introduction

Video description is a common application of video understanding models and an intuitive task to evaluate the capabilities of video understanding models. However, existing video captioning datasets typically pair each video with a single-sentence caption, which is too short to stress-test the full capacity of models. To this end, DREAM-1K is proposed as a challenging Video Description Benchmark features in Rich events, Actions, and Motions. It contains: 1) 1,000 video clips with diverse complexities from five different origins: live-action movies, animation movies, stock videos, YouTube videos, and TikTok-style short videos. 2) A fine-grained description for each video clip labeled by skilled annotators. As traditional automatic evaluation metrics such as CIDEr for video captioning are not competent for evaluating the quality of detailed video descriptions, DREAM-1K takes AutoDQ (Automatic Description Quality) as a more interpretable automatic evaluation approach, which has been proven to be discriminative in practice and shows a high degree of consistency with manual evaluation.

Leaderboard

LM: Live-action movies          AM: Animation movies          ST: Stock videos          YT: YouTube videos          TT: TikTok-style short videos

F1: F1 score          P: Precision score          R: Recall score

By default, this leaderboard is sorted by overall F1 score, with overall Precision score as a secondary sort key. To view other sorted results, please click on the corresponding cell.

# Model LLM
Params
Frames Date Overall (%) LM (%) AM (%) ST (%) YT (%) TT (%)
F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R
Tarsier2-7B

Bytedance Research

7B 16 2024/09/20 40.1 41.5 38.8 43.8 43.1 44.5 37.1 37.2 37.1 43.9 44.2 43.6 34.5 37.7 31.8 40.9 45.5 37.1
GPT-4o

OpenAI

- 16 2024/05/13 39.2 43.4 35.7 39.8 42.1 37.8 35.8 39.1 33.1 44.0 46.6 41.7 35.9 41.5 31.7 39.9 47.9 34.2
Tarsier-34B

Bytedance Research

34B 8 2024/07/04 36.3 41.4 32.4 38.5 39.6 37.5 32.2 35.8 29.2 41.7 46.4 37.8 34.5 41.1 29.7 34.0 44.1 27.7
Gemini 1.5 Pro

Google

- 1 fps 2024/05/15 36.2 37.6 34.8 36.4 36.4 36.4 30.7 31.8 29.7 42.2 40.7 43.8 34.0 36.7 31.6 37.0 42.4 32.7
Gemini 1.5 Flash

Google

- 1 fps 2024/05/15 34.8 37.9 32.1 34.8 36.4 33.3 29.2 32.5 26.5 39.4 39.7 39.1 34.3 38.6 30.9 35.6 42.4 30.7
GPT-4V

OpenAI

- 8 2023/09/25 34.4 40.8 29.7 34.8 39.2 31.3 27.4 31.9 24.0 40.7 46.7 36.1 33.8 40.1 29.2 34.8 46.1 28.0
GPT-4o mini

OpenAI

- 8 2024/07/18 34.0 37.4 31.2 34.5 36.6 32.7 28.9 32.6 26.0 37.9 37.9 38.0 33.5 37.5 30.2 34.7 42.4 29.3
LLaVA-Video

Bytedance & NTU S-Lab

72B 32 2024/09/02 34.0 37.3 31.3 33.5 36.3 31.1 28.6 31.7 26.1 39.3 41.1 37.6 32.8 34.7 31.1 35.7 42.8 30.6
VILA-1.5

NVIDIA & MIT

34B 8 2024/05/01 33.2 37.6 29.7 34.1 34.3 33.8 27.1 30.3 24.5 38.4 41.7 35.6 33.7 38.6 29.9 31.5 43.2 24.8
Qwen2-VL

Alibaba

72B 2 fps 2024/08/30 33.2 37.3 29.9 32.1 33.7 30.6 27.6 32.6 23.9 41.1 41.2 41.1 32.0 38.1 27.7 32.1 41.0 26.4
LLaVA-OneVision

Bytedance & NTU S-Lab

72B 16 2024/08/06 33.2 35.9 30.9 31.7 32.8 30.7 27.7 30.6 25.2 38.0 39.6 36.6 34.1 34.7 33.5 33.8 41.8 28.4
LLaVA-Video

Bytedance & NTU S-Lab

32B 32 2024/09/02 32.9 34.4 31.6 32.2 32.2 32.2 28.5 29.4 27.7 38.2 38.4 37.9 31.3 33.9 29.1 34.2 38.2 30.9
LLaVA-Video

Bytedance & NTU S-Lab

7B 32 2024/09/02 32.5 37.9 28.4 31.4 35.2 28.4 27.6 32.9 23.8 36.7 39.7 34.1 33.0 39.5 28.3 33.4 42.5 27.5
InternVL2

Shanghai AI Lab

72B 8 2024/07/18 31.7 38.7 26.9 33.0 36.2 30.3 24.0 29.4 20.2 38.7 45.2 33.8 30.1 38.2 24.8 32.3 44.2 25.4
LLaVA-OneVision

Bytedance & NTU S-Lab

7B 16 2024/08/06 31.7 34.3 29.4 31.2 33.2 29.3 26.8 29.0 25.0 38.1 39.1 37.1 30.6 32.1 29.2 31.4 38.3 26.6
Claude 3.5 Sonnet

Anthropic

- 16 2024/06/20 31.2 36.5 27.3 29.1 34.0 25.4 26.7 33.4 22.3 35.9 37.4 34.5 30.6 37.2 25.9 33.4 40.4 28.5
MiniCPM-V 2.6

OpenBMB

7B 64 2024/08/05 30.5 33.5 27.9 30.5 33.8 27.7 24.8 27.8 22.5 35.4 35.8 35.0 29.5 31.3 28.0 31.6 38.9 26.5
VILA-1.5

NVIDIA & MIT

7B 8 2024/05/01 29.9 35.4 25.8 29.2 32.1 26.8 23.9 30.0 19.9 35.1 38.2 32.5 31.7 37.1 27.7 28.5 39.5 22.3
Qwen2-VL

Alibaba

7B 2 fps 2024/08/30 29.6 33.9 26.3 27.7 32.5 24.2 22.2 28.0 18.4 37.0 36.1 38.0 30.7 35.5 27.0 29.1 37.6 23.8
VITA

Tencent Youtu Lab & NJU

8×7B 16 2024/08/12 28.9 40.2 22.6 30.3 39.1 24.7 22.6 31.3 17.7 35.6 46.7 28.7 28.1 39.9 21.7 27.5 44.2 20.0
PLLaVA

Bytedance & NUS & NYU

34B 16 2024/04/24 28.2 38.4 22.3 29.3 34.9 25.2 20.9 32.0 15.6 35.1 42.5 29.9 28.9 40.8 22.3 25.6 41.9 18.4
VideoLLaMA 2

Alibaba

72B 16 2024/08/14 27.1 30.8 24.2 27.3 29.3 25.6 19.7 21.7 18.1 33.9 37.0 31.3 27.7 33.0 23.8 26.5 33.1 22.1
InternVL2

Shanghai AI Lab

7B 8 2024/07/18 26.9 29.5 24.7 27.3 27.4 27.1 20.6 23.8 18.1 33.3 33.5 33.0 26.9 30.2 24.2 25.7 32.7 21.2
VideoChat2-Mistral

Shanghai AI Lab

7B 16 2024/05/22 26.6 31.0 23.3 26.0 28.1 24.3 18.8 23.9 15.5 33.2 33.6 32.7 27.3 32.7 23.4 26.5 36.8 20.7
VideoLLaMA 2

Alibaba

7B 16 2024/06/06 26.2 31.5 22.4 25.1 28.7 22.2 20.4 25.5 17.0 32.6 35.5 30.2 27.5 33.5 23.4 24.5 34.1 19.2
Video-LLaVA

PKU

7B 8 2024/06/15 20.4 28.1 16.0 19.4 24.3 16.2 15.3 21.2 11.9 27.0 33.5 22.7 21.2 31.9 15.8 18.5 29.4 13.5
ShareGPT4Video

Shanghai AI Lab

8B 16 2024/06/11 19.5 26.6 15.4 19.8 27.8 15.4 14.8 22.8 10.9 25.2 28.9 22.4 19.1 26.2 15.0 18.0 27.3 13.4

Date: indicates the publication date of open-source models          -: indicates "unknown" for closed-source models

Benchmark

Data Examples

See the DREAM-1K Explorer for more videos and human-annotated/model-generated descriptions.

Benchmark Statistics

data-composition

Evaluation Approach

We propose AutoDQ as a more interpretable approach to automatic video description evaluation to overcome the shortage of traditional automatic evaluation metrics (such as CIDEr) in evaluating detailed video descriptions. AutoDQ uses an extraction model to extract events from two video descriptions, then uses an entailment model to examine how many events extracted from one description are entailed by the other description. The ratio of events in reference description (human annotation) that are entailed by model-generated description is defined as Recall, and vice versa is is defined as Precision. We use ChatGPT (gpt-35-turbo-0125) to implement both models (See evaluation scripts), as shown bellow.

AutoDQ

grade-lv

The AutoDQ workflow.

Citation


      @misc{wang2024tarsierrecipestrainingevaluating,
        title={Tarsier: Recipes for Training and Evaluating Large Video Description Models}, 
        author={Jiawei Wang and Liping Yuan and Yuchen Zhang and Haomiao Sun},
        year={2024},
        eprint={2407.00634},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2407.00634}, 
  }