Home

ViLBERT

ViLBERT, short for Vision-and-Language BERT, is a model and pretraining framework designed for learning joint representations of images and text. Introduced in 2019 by Jiasen Lu, Dhruv Batra, Devi Parikh, and colleagues at Facebook AI Research, ViLBERT aims to produce task-agnostic visual and linguistic features that can be fine-tuned for a range of vision-and-language tasks, including visual question answering, image captioning, and referring expression comprehension. The approach emphasizes modular design with cross-modal fusion through attention mechanisms.

Architecture is built as a two-stream model consisting of separate Transformer-based encoders for visual and linguistic

Pretraining and objectives include image-text matching and masked language modeling, alongside a masked region modeling objective

Impact and influence: ViLBERT helped establish a foundation for multitask vision-and-language pretraining and influenced subsequent models

inputs.
The
visual
stream
processes
region-level
features
extracted
from
a
pretrained
object
detector
such
as
Faster
R-CNN,
while
the
language
stream
processes
tokenized
text.
A
subsequent
co-attentional
transformer
layer
enables
cross-modal
attention,
allowing
the
two
streams
to
exchange
information
and
produce
joint
representations
that
capture
intermodal
relationships.
The
architecture
supports
task-agnostic
pretraining,
after
which
the
model
is
fine-tuned
for
specific
downstream
tasks.
that
aligns
visual
regions
with
verbal
context.
These
objectives
help
the
model
learn
compatible
representations
across
modalities
and
improve
performance
on
diverse
V+L
tasks.
such
as
LXMERT,
VisualBERT,
and
UNITER.
It
contributed
to
improvements
on
benchmarks
like
VQA
and
image
captioning
and
highlighted
the
effectiveness
of
separate
modality
streams
fused
through
cross-modal
attention.