I3D
I3D, or Inflated 3D ConvNet, is a convolutional neural network architecture designed for video action recognition. It was introduced by Joao Carreira and Andrew Zisserman in 2017 as an extension of 2D convolutional networks into the spatiotemporal domain by inflating 2D kernels into 3D. The key idea is to take established 2D architectures, such as Inception networks, inflate the spatial kernels to include a temporal dimension, and initialize the weights from pretraining on image data. This approach allows the model to learn both spatial and temporal features from video while leveraging large-scale image datasets.
The architecture uses standard Inception modules adapted to 3D convolutions, featuring multiple branches that capture features
I3D achieved strong results on benchmark video datasets and became a common baseline for action recognition