[Paper review] Moments in Time Dataset | (ft. Video Action Recognition)

Abstract: Action Recognition 과제에서 2017년에 처음 공개된 MiT 데이터셋에 대한 정리.

[Paper review] Moments in Time Dataset

Abstract

one million labeled videos for 339 classes corresponding to dynamic events unfolding within 3 seconds.
the average number of labeled videos per class is 1,757 with a median of 2,775.
meaningful events include not only people, but also objects, animals, and natural phenomena (human & non-human).

Dataset Overview

sample Videos.png

Notable Experimental Setup

Data. They generate a training set of 802,264 videos with between 500 and 5,000 videos per class for 339 different classes. Then, for evaluation of performance a validation set of 33,900 videos with 100 videos for each class is prepared. Finally, they withhold a test set of 67,800 videos consisting of 200 videos per class.

$train:val:test = 802,264 : 33,900 : 67,800\approx 8.8:0.37:0.75$

Preprocessing. In this section we can summarize how to prepare inputs which are applied to both Two-Stream and I3D. Firstly, they extract RGB frames from the videos at 25 fps and resize the RGB frames to a standard $340\times256$ pixels. In the interest of performance, optical flow(OF) on consecutive frames is pre-computed using an off-the-shelf implementation of TVL1 OF algorithm from the OpenCV. Especially, for fast computation the values of optical flow fields are discretized into integers, the displacement is clipped with a maximum absolute value of 15, and the value range is scaled to 0-255. The $x$ and $y$ displacement fields of every optical flow frame are then stored as two grayscale images to reduce storage. Additionally, to correct for camera motion, they subtract the mean vector from each displacement field in the stack. On the augmentation steps, they use random cropping and subtract the ImageNet mean/std from images (이미지 정규화 값을 ImageNet 셋팅으로 함).

Evaluation metric. They introduce top-1 and top-5 classification accuracy as the scoring metrics. Top-1 accuracy indicates the percentage of testing videos for which the top confident predicted label is correct. Top-5 accuracy indicates the percentage of the testing videos for which the ground-truth label is among the top 5 ranked predicted labels. Especially, top-5 accuracy is appropriate for video classification as videos may contain multiple actions. For evaluation they randomly select 10 crops per frame and average the results.

Baseline models for Video Classification. They split the results into three modalities (spatial, temporal, and auditory), as well as spatiotemporal such as (TSN and TRN).

Spatial modality. They introduce a 50 layer resnet ( $Resnet50$ ) trained on randomly selected RGB frames from each video. The training setups for this are divided into three versions; training from scratch ( $ResNet50-scratch)$ , initialized on Places ( $ResNet50-Places)$ , and initialized on ImageNet ( $ResNet50-ImageNet)$ . In testing, they average the prediction from 6 equi-distance frames.

Auditory modality. Sound signals contain complementary or even mandatory information for recognition of particular classes, such as cheering or talking. So, they use raw waveforms as the input modality and finetune a SoundNet network which was pretrained on 2 million unlabeled videos from Flickr with the output layer changed to predict moment classes (SoundNet).

Temporal modality. Following the Two-Stream paradigm, the optical flow between adjacent frames, which encoded in Cartesian coordinates as displacements, are computed by stacking together 5 consecutive frames to form a 10 channel image (the $x$ and $y$ displacement channels). Then, the first convolutional layer of BNInception model is modified in order to accept 10 input channels (Batch Norm-Inception).

Spatial-Temporal modality. Three representative action recognition models (at that moment) are introduced: Temporal Segment Networks(TSN), Temporal Relation Networks(TRN), and Inflated 3D Convolutional Networks (I3D).

TSN aims to efficiently capture the long-range temporal structure of videos using a sparse frame-sampling strategy. The TSN’s spatial stream(TSN-spatial) is fused with an optical flow stream (TSN-Flow) via average consensus to form the two stream TSN. The base model for each stream is a BNInception model with three time segments.
TRN learns temporal dependencies between video segments that best characterize a particular action. This “plug-and-play” module can simultaneously model several short and long range temporal dependencies to classify actions that unfold at multiple time scales.
I3D inflates the convolutional and pooling kernels of a pretrained 2D network to a third dimension. The inflated 3D kernel is initialized from the 2D model by repeating the weights from the 2D kernel over the temporal dimension. This improves learning efficiency and performance as 3D models contain far more parameters than their 2D counterpart and a strong initialization greatly improves training.

Ensemble. Combination of the top performing model of each modality (spatial + spatiotemporal + auditory). (skip this part)

Results

TABLE 2 shows that using the model trained on MiT dataset performs more generalized results so that it is applicable for transfer learning (ImageNet 처럼 video understanding에서 사전 학습 모델로 사용하기 적합하다는 주장).
This dataset presents a difficult task for the field of computer vision as the labels correspond to different levels of abstraction (a verb like “falling” can apply to many different agents and scenarios and involve objects and scenes of different categories)

Reference

[1] Moments in Time Dataset: one million videos for event understanding, TPAMI2019

[2] Moments in Time Dataset Homepage
[3] My paper annotation

Search

컴돌이 도란뇽's AI 스터디 노트