ViViT - Infinite Lexicon - Infinite Lexicon

ViViT

ViViT, short for Video Vision Transformer, is a family of transformer-based architectures for video understanding, particularly action recognition. It extends the Vision Transformer (ViT) approach from images to video by applying patch-based representations to individual frames and modeling temporal dynamics with transformer layers. ViViT was introduced by Arnab, De, et al. in 2021, presenting a scalable framework for processing video data with self-attention mechanisms.

The core idea of ViViT is to treat a video clip as a sequence of tokens derived

ViViT has been evaluated on standard benchmarks such as Kinetics-400 and Kinetics-700 and has shown competitive

non-overlapping

a

a

A