Visiontransformerit
Visiontransformerit, commonly referred to as Vision Transformer (ViT), is a neural network architecture that adapts the Transformer model from natural language processing to image recognition. It treats a fixed-size image as a sequence of patches, each patch converted to an embedding vector, and processes the sequence with a standard Transformer encoder.
An image is divided into patches (for example 16×16 pixels), each patch is flattened and projected through
During training, ViT models are pretrained on large-scale image datasets and then fine-tuned on target tasks.
ViT can serve as a backbone for detection, segmentation, and other vision tasks, often by combining with
Originating from the 2020 work by Dosovitskiy et al. from Google Research, Vision Transformer has inspired