Computes groups of matrix-matrix product with general matrices.
Group API
event gemm_batch(queue &exec_queue, transpose transa, transpose transb, std::int64_t *m, std::int64_t *n, std::int64_t *k, T *alpha, T **A, std::int64_t *lda, T **B, std::int64_t *ldb, T *beta, T **C, std::int64_t *ldc, std::int64_t group_count, std::int64_t *groupsize, const vector_class<event> &dependencies = {});
Strided API
event gemm_batch(queue &exec_queue, transpose transa, transpose transb, std::int64_t m, std::int64_t n, std::int64_t k, T alpha, T *a, std::int64_t lda, std::int64_t stridea, T *b, std::int64_t ldb, std::int64_t strideb, T beta, T *c, std::int64_t ldc, std::int64_t stridec, std::int64_t batch_size, const vector_class<event> &dependencies = {});
void gemm_batch(queue &exec_queue, transpose transa, transpose transb, std::int64_t m, std::int64_t n, std::int64_t k, T alpha, buffer<T,1> &a, std::int64_t &lda, std::int64_t stridea, buffer<T,1> &b, std::int64_t ldb, std::int64_t strideb, T beta, buffer<T,1> &c, std::int64_t ldc, std::int64_t stridec, std::int64_t batch_size);
gemm_batch supports the following precisions and devices.
T | Devices Supported |
---|---|
float | Host, CPU, and GPU |
half | Host, CPU, and GPU |
double | Host, CPU, and GPU |
std::complex<float> | Host, CPU, and GPU |
std::complex<double> | Host, CPU, and GPU |
The gemm_batch routines perform a series of matrix-matrix operations with general matrices. They are similar to the gemm routine counterparts, but the gemm_batch routines perform matrix-matrix operations with groups of matrices. The groups contain matrices with the same parameters.
The operation for the strided API is defined as
for i = 0 … batch_size – 1 A, B and C are matrices at offset i * stridea, i * strideb, i * stridec in a, b and c. C = alpha * op(A) * op(B) + beta * C end for
The operation for the group API is defined as
idx = 0 for i = 0 … group_count – 1 m,n,k, alpha, beta, lda, ldb, ldc and group_size at position i in their respective arrays. for j = 0 … group_size – 1 A,B and C are matrices of size at position idx in their respective arrays C = alpha * op(A) * op(B) + beta * C idx := idx + 1 end for end for
where:
op(X) is one of op(X) = X, or op(X) = XT, or op(X) = XH
alpha and beta are scalars
A, B, and C are matrices
The a, b and c buffers contains all the input matrices. The stride between matrices is either given by the exact size of the matrix or by the stride parameter. The total number of matrices in a, b and c buffers is given by the batch_size parameter.
Here, op(A) is mxk, op(B) is kxn, and C is mxn.
Strided API
Specifies op(A) the transposition operation applied to the matrices A. See Data Types for more details.
Specifies op(B) the transposition operation applied to the matrices B. See Data Types for more details.
Number of rows of op(A) and C. Must be at least zero.
Number of columns of op(B) and C. Must be at least zero.
Number of columns of op(A) and rows of op(B). Must be at least zero.
Scaling factor for the matrix-matrix products.
Buffer holding the input matrices A. Must have size at least stridea*batch_size.
Leading dimension of the A matrices. If matrices are stored using column major layout, lda must be at least m if A is not transposed, and at least k if A is transposed. If matrices are stored using row major layout, lda must be at least k if A is not transposed, and at least m if A is transposed. It must be positive.
Stride between the different A matrices. If matrices are stored using column (respectively, row) major layout, stridea must be at least lda*k (respectively, lda*m) if A is not transposed and at least lda*m (respectively, lda*k) if A is transposed.
Buffer holding the input matrices B. Must have size at least strideb*batch_size.
Leading dimension of the matrices B. Must be at least k if the matrices B are not transposed, and at least n if the matrices B are transposed. Must be positive. If matrices are stored using column major layout, ldb must be at least k if B is not transposed, and m if B is transposed. If matrices are stored using row major layout, ldb must be at least n if B is not transposed, and at least k if B is transposed. It must be positive.
Stride between the different B matrices. If matrices are stored using column (respectively row) major layout, strideb must be at least ldb*n (respectively, lda*k) if B is not transposed and at least ldb*k (respectively, ldb*n) if B is transposed
Scaling factor for the matrices C.
Buffer holding input/output matrices C. Must have size at least stridec*batch_size.
Leading dimension of C. Must be positive and at least m. If matrices are stored using column major layout, ldc must be at least m. If matrices are stored using row major layout, ldc must be at least n. It must be positive
Stride between the different C matrices. If matrices are stored using column (respectively, row) major layout, stridec must least ldc*n (respectively, ldc*m).
Specifies the number of matrix multiply operations to perform.
Group API
Array of size group_count of number. Each element i in the array specifies op(A) the transposition operation applied to the matrices A. See Data Types for more details.
Array of size group_count of number. Each element i in the array specifies op(B) the transposition operation applied to the matrices B. See Data Types for more details.
Array of size group_count of number of rows of op(A) and C. Each must be at least zero.
Array of size group_count of number of columns of op(B) and C. Each must be at least zero.
Array of size group_count of number of columns of op(A) and rows of op(B). Each must be at least zero.
Array of size total_batch_count of scaling factor for the matrix-matrix products.
Array of size total_batch_count of pointers used to store A matrices. If matrices are stored using column (respectively, row) major, the array allocated for the A matrices of the group i must be of size at least ldai * ki (respectivelymimi, ldai * ) if A is not transposed and ldai*mi (respectively, ldai*ki) if A is transposed.
Array of size group_count of leading dimension of the A matrices. If matrices are stored using column major layout, ldai must be at least mi if A is not transposed, and at least ki if A is transposed. If matrices are stored using row major layout, ldai must be at least ki if A is not transposed, and at least mi if A is transposed. Each must be positive.
Array of size total_batch_count of pointers used to store B matrices. If matrices are stored using column (respectively, row) major, the array allocated for the B matrices of the group i must be of size at least ldbi * ki (respectively, ldbi * mi) if B is not transposed and ldbi*mi (respectively, ldbi*ki) if B is transposed.
Array of size group_count of leading dimension of the B matrices. If matrices are stored using column major layout, ldbi must be at least mi if B is not transposed, and at least ki if B is transposed. If matrices are stored using row major layout, ldbi must be at least ki if B is not transposed, and at least mi if B is transposed. Each must be positive.
Scaling factor for the matrices C.
Array of size total_batch_count of pointers used to store C matrices. If matrices are stored using column (respectively, row) major, the array allocated for the C matrices of the group i must be of size at least ldci * ki (respectively, ldci * mi) if C is not transposed and ldci*mi (respectively, ldci*ki) if C is transposed.
Array of size group_count of leading dimension of the C matrices. If matrices are stored using column major layout, ldci must be at least mi if C is not transposed, and at least ki if C is transposed. If matrices are stored using row major layout, must be at least ki if Cldci is not transposed, and at least mi if C is transposed. Each must be positive.
Number of groups. Must be at least 0.
Array of size group_count. The element groupsize[i] is the number of matrices in the group i. Each element in group_size must be at least 0.
List of events to wait for before starting computation, if any. If omitted, defaults to no dependencies.
Strided API
Output buffer, overwritten by batch_size matrix multiply operations of the form alpha*op(A)*op(B) + beta*C.
Group API
Output array of pointers of C matrices, overwritten by total_batch_count matrix multiply operations of the form alpha*op(A)*op(B) + beta*C.
If beta = 0, matrix C does not need to be initialized before calling gemm_batch.