TimeSformer
TimeSformer is a neural network architecture designed for video understanding that applies the Transformer framework to spatiotemporal data. It extends the concept of Vision Transformers (ViT) to video by representing a clip as a sequence of tokens formed from patches sampled across multiple frames. Each frame is divided into non-overlapping patches, which are flattened and projected to a latent dimension. A class token and learnable spatial and temporal position embeddings are added, enabling the model to incorporate both where a patch is located and when it appears in the sequence.
A central contribution of TimeSformer is the idea of Divided Space-Time Attention, an efficient alternative to
In evaluation, TimeSformer variants demonstrated competitive performance on standard video action recognition benchmarks, achieving strong results
Limitations cited in early work include sensitivity to pretraining data size and potential challenges in capturing