visjonstransformere - Infinite Lexicon - Infinite Lexicon

visjonstransformere

Visjonstransformere, or Vision Transformers (ViT), are a type of neural network architecture that applies the transformer model, originally developed for natural language processing, to computer vision tasks. Unlike traditional convolutional neural networks (CNNs) that process images through hierarchical feature extraction using convolutional layers, ViTs treat images as sequences of patches.

The core idea behind ViTs is to divide an input image into a grid of fixed-size patches.

This departure from the localized receptive fields of CNNs allows ViTs to capture long-range dependencies across

a

a

state-of-the-art

classification,

self-supervised