Home

rocblassgemmstridedbatched

rocblas_sgemm_strided_batched is a single-precision batched general matrix-matrix multiply routine in the ROCm library rocBLAS. It applies a strided batched interface to compute multiple independent C_i matrices, each equal to alpha times the product of A_i and B_i (with optional transpositions), plus beta times the existing C_i. The “strided” aspect means the i-th matrices are laid out in memory with fixed strides between A_i, B_i, and C_i, allowing a single function call to process a batch.

Signature overview

rocblas_status rocblas_sgemm_strided_batched(rocblas_handle handle,

rocblas_operation transa,

rocblas_operation transb,

int m, int n, int k,

const float* alpha,

const float* A, int lda, long long strideA,

const float* B, int ldb, long long strideB,

const float* beta,

float* C, int ldc, long long strideC,

int batch_count);

Key parameters

- handle: rocBLAS library context.

- transa, transb: specify whether to transpose or conjugate-transpose A and/or B.

- m, n, k: matrix dimensions for the operation C_i = op(A_i) × op(B_i) with sizes (m×k) and

- alpha, beta: pointers to scalar multipliers applied to the product and to the existing C_i, respectively.

- A, B, C: pointers to the batched matrices in device memory.

- lda, ldb, ldc: leading dimensions of A, B, and C.

- strideA, strideB, strideC: offsets between consecutive A_i, B_i, and C_i in elements.

- batch_count: number of matrices in the batch.

Usage notes

- Stride values should be at least the size of a single matrix to cover all batch elements.

- Leading dimensions must satisfy lda ≥ max(1, transposed? m: k), etc.

- Suitable for large-scale linear algebra workloads on AMD GPUs; performance benefits come from batched execution and

Related variants include roblas_dgemm_strided_batched and other gemm variants for complex types.

(k×n)
resulting
in
(m×n).
memory
coalescing.