cuBLAS
cuBLAS is a GPU-accelerated Basic Linear Algebra Subprograms (BLAS) library developed by NVIDIA as part of the CUDA Toolkit. It provides high-performance implementations of BLAS Level 1 through Level 3 routines for NVIDIA GPUs, enabling operations such as vector operations, matrix-vector products, and matrix-matrix multiplies on device memory. The library supports single and double precision, real and complex data types, and offers batched and strided batched variants to process multiple problems concurrently.
The API revolves around a cuBLAS handle, created with a function like cublasCreate and destroyed with cublasDestroy.
cuBLAS operates exclusively on GPU memory; data must reside in device memory and results are written back