ViViT
ViViT, short for Video Vision Transformer, is a family of transformer-based architectures for video understanding, particularly action recognition. It extends the Vision Transformer (ViT) approach from images to video by applying patch-based representations to individual frames and modeling temporal dynamics with transformer layers. ViViT was introduced by Arnab, De, et al. in 2021, presenting a scalable framework for processing video data with self-attention mechanisms.
The core idea of ViViT is to treat a video clip as a sequence of tokens derived
ViViT has been evaluated on standard benchmarks such as Kinetics-400 and Kinetics-700 and has shown competitive