跳转至

Lily's Notebook

Lecture 10 Video Understanding

Video Understanding¶

Late Fusion Approach¶

Fusing the features from different modalities at the end of the network.

FC layers may result in overfitting

Early Fusion Approach¶

Stakcing through channel dimension.
Fuse the features Using 2D convolutions.(\(3T\times D\) )

* One layer of temporal processing may not be enough!

Comparison¶

General Comparison¶

2D Conv (Early Fusion) vs 3D Conv (3D CNN)¶

Details see Slides. * 2D : perceptive of features of time at once and slide over the space feature. No temporal shift-invariance! Needs to learn separate filters for the same motion at different times in the clip.

3D : Temporal shift-invariant since each filter slides over time !

C3D : The VGG of 3D CNNs¶

Measuring Motion : Optical Flow¶

Two Stream Networks¶

Modeling long-term temporal structure¶

* Sometimes don’t backprop to CNN to save memory; pretrain and use it as a feature extractor : Like using pretrained C3D as a feature extractor.

Recurrent Convolutional Network¶

Actually not too much used because of the time complexity.(Not good for parallelization) Sequential processing is not good for parallelization.

Spatio-Temporal Self-Attention (Non Local Block)¶

Trick:Add non-local blocks to 3D CNNs , initilize the weights of the non-local block with all zeros, and fine-tune the network.

Inflating 2D Networks to 3D (I3D)¶

Refer to the slides for details. Trick: Pretrain 2D CNNs and inflate them to 3D CNNs by repeating the weights along the temporal dimension.

最后更新: 2024年4月21日 10:20:14
创建日期: 2024年4月1日 13:06:08