Home

detrs

DETRs, or Detection Transformers, are a family of end-to-end object detectors that apply transformer architectures to visual data. They were introduced by Carion and colleagues in 2020 in the paper End-to-End Object Detection with Transformers, and they represented a shift toward single-stage, post-processing-free object detection.

In a typical DETR, a convolutional neural network backbone (such as ResNet with a feature pyramid) extracts

Several variants have been proposed to improve training efficiency and performance. Deformable DETR introduces deformable attention

Applications of DETRs include general object detection and related tasks such as instance segmentation and panoptic

DETRs represent a notable development in computer vision, illustrating how transformer architectures can be integrated into

a
feature
map
from
the
input
image.
A
Transformer
encoder
processes
this
map,
and
a
decoder
attends
to
a
fixed
set
of
learned
object
queries
to
produce
a
corresponding
set
of
predictions.
Each
prediction
consists
of
a
class
label
and
a
bounding
box.
Training
relies
on
a
set-based
global
loss
obtained
by
solving
a
bipartite
matching
problem
(Hungarian
algorithm),
ensuring
a
one-to-one
correspondence
between
predictions
and
ground-truth
objects
and
removing
the
need
for
non-maximum
suppression
during
inference.
to
focus
computation
on
a
sparse
set
of
spatial
locations,
enabling
faster
convergence
and
better
handling
of
high-resolution
images
with
multi-scale
features.
Other
variants,
such
as
Conditional
DETR
and
SMCA-based
DETR,
aim
to
improve
training
stability
and
accuracy.
segmentation,
with
DETR-based
models
achieving
competitive
results
on
benchmarks
like
COCO.
While
original
DETR
models
could
require
longer
training
and
substantial
computational
resources,
newer
variants
mitigate
these
constraints
and
offer
improved
convergence
and
performance.
detection
tasks
and
adapted
through
variants
to
balance
accuracy,
speed,
and
data
requirements.