ViLBERT
ViLBERT, short for Vision-and-Language BERT, is a model and pretraining framework designed for learning joint representations of images and text. Introduced in 2019 by Jiasen Lu, Dhruv Batra, Devi Parikh, and colleagues at Facebook AI Research, ViLBERT aims to produce task-agnostic visual and linguistic features that can be fine-tuned for a range of vision-and-language tasks, including visual question answering, image captioning, and referring expression comprehension. The approach emphasizes modular design with cross-modal fusion through attention mechanisms.
Architecture is built as a two-stream model consisting of separate Transformer-based encoders for visual and linguistic
Pretraining and objectives include image-text matching and masked language modeling, alongside a masked region modeling objective
Impact and influence: ViLBERT helped establish a foundation for multitask vision-and-language pretraining and influenced subsequent models