rocblassgemmstridedbatched
rocblas_sgemm_strided_batched is a single-precision batched general matrix-matrix multiply routine in the ROCm library rocBLAS. It applies a strided batched interface to compute multiple independent C_i matrices, each equal to alpha times the product of A_i and B_i (with optional transpositions), plus beta times the existing C_i. The “strided” aspect means the i-th matrices are laid out in memory with fixed strides between A_i, B_i, and C_i, allowing a single function call to process a batch.
rocblas_status rocblas_sgemm_strided_batched(rocblas_handle handle,
const float* A, int lda, long long strideA,
const float* B, int ldb, long long strideB,
float* C, int ldc, long long strideC,
- handle: rocBLAS library context.
- transa, transb: specify whether to transpose or conjugate-transpose A and/or B.
- m, n, k: matrix dimensions for the operation C_i = op(A_i) × op(B_i) with sizes (m×k) and
- alpha, beta: pointers to scalar multipliers applied to the product and to the existing C_i, respectively.
- A, B, C: pointers to the batched matrices in device memory.
- lda, ldb, ldc: leading dimensions of A, B, and C.
- strideA, strideB, strideC: offsets between consecutive A_i, B_i, and C_i in elements.
- batch_count: number of matrices in the batch.
- Stride values should be at least the size of a single matrix to cover all batch elements.
- Leading dimensions must satisfy lda ≥ max(1, transposed? m: k), etc.
- Suitable for large-scale linear algebra workloads on AMD GPUs; performance benefits come from batched execution and
Related variants include roblas_dgemm_strided_batched and other gemm variants for complex types.