Home

crossattention

Cross-attention is a mechanism used in transformer architectures to fuse information from two different sequences or modalities. In a cross-attention layer, the queries come from one source (for example, the current decoding context), while the keys and values come from another source (for example, the encoder output or a set of feature representations). This contrasts with self-attention, where queries, keys, and values all derive from the same sequence.

In multi-head cross-attention, each head computes its own Q from the first source and K,V from the

Common contexts for cross-attention include encoder-decoder models such as neural machine translation, where the decoder attends

Challenges and considerations include computational complexity, which is quadratic in the lengths of the interacting sequences,

second.
The
attention
weights
are
computed
as
softmax(QK^T
/
sqrt(d_k)),
and
the
output
is
the
weighted
sum
of
V,
concatenated
across
heads
and
projected
back
to
the
model
dimension.
This
allows
the
model
to
attend
to
different
aspects
of
the
second
sequence
while
processing
the
first.
to
encoder
representations
to
generate
each
target
token.
It
is
also
central
to
multimodal
transformers
that
align
text
with
images,
audio,
or
other
modalities,
enabling
conditional
generation
or
cross-modal
reasoning.
and
potential
alignment
ambiguity
between
the
sources.
Variants
and
optimizations,
such
as
limiting
attention
to
specific
segments
or
applying
efficient
attention
mechanisms,
are
areas
of
ongoing
research.