Home

stridedbatched

Strided batched is a data layout and interface for performing a batch of identical linear algebra operations where the matrices for each batch item are stored in memory with a fixed stride between consecutive items. This pattern is commonly used for batched matrix-matrix multiplications (GEMMs) and other batched BLAS operations on accelerators, notably GPUs. In a strided batched GEMM, for i from 0 to batchCount-1, the operation is C_i = alpha * op(A_i) * op(B_i) + beta * C_i, where A_i, B_i, and C_i are matrices of fixed shapes, and op denotes optional transposition or conjugate transposition.

Typical parameterization (as in cuBLAS GEMM Strided Batched) includes: m, n, k defining the matrix dimensions;

Strided batched contrasts with pointer-based batched methods (arrays of pointers to A_i, B_i, C_i). Strided storage

alpha
and
beta
scalars;
pointers
to
the
first
A,
B,
and
C
matrices;
leading
dimensions
lda,
ldb,
ldc;
and
strides
strideA,
strideB,
strideC
that
specify
the
distance
in
elements
between
the
starts
of
A_i,
B_i,
and
C_i
for
consecutive
batch
indices.
The
batchCount
indicates
how
many
batch
items
to
process.
For
example,
A_i
is
stored
starting
at
baseA
+
i
*
strideA,
B_i
at
baseB
+
i
*
strideB,
and
C_i
at
baseC
+
i
*
strideC.
When
matrices
are
stored
contiguously,
strideA
=
lda
*
k,
strideB
=
ldb
*
n,
and
strideC
=
ldc
*
n.
enables
better
memory
coalescing
and
lower
pointer
overhead,
but
requires
uniform
shapes
and
regular
strides.
It
is
widely
supported
in
high-performance
libraries
and
is
commonly
used
in
GPU-accelerated
workloads,
such
as
deep
learning
and
scientific
computing,
where
many
independent
GEMMs
of
identical
sizes
must
be
computed
efficiently.