OpenMP* Offload for Intel® oneAPI Math Kernel Library

You can use Intel® oneAPI Math Kernel Library (oneMKL) and OpenMP* offload to run standard oneMKL computations on Intel GPUs. You can find the list of oneMKL features that support OpenMP offload in the mkl_omp_offload.f90 interface module file which includes:

The OpenMP offload feature from Intel® oneAPI Math Kernel Library (oneMKL) allows you to run oneMKL computations on Intel GPUs through the standard oneMKL APIs within an omp dispatch section. For example, the standard CBLAS API for single precision real data type matrix multiply is:

subroutine sgemm ( transa, transb, m, n, k, alpha, a, lda,        &
          &b, ldb, beta, c, ldc ) BIND(C)
       character*1,intent(in)              :: transa, transb
       integer,intent(in)                  :: m, n, k, lda, ldb, ldc
       real,intent(in)                     :: alpha, beta
       real,intent(in)                     :: a( lda, * ), b( ldb, * )
       real,intent(inout)                  :: c( ldc, * )
     end subroutine sgemm

If sgemm is called outside of an omp dispatch section or if offload is disabled, then the CPU implementation is dispatched. If the same function is called within an omp dispatch section and offload is possible then the GPU implementation is dispatched. By default the execution of the oneMKL function within a dispatch construct is synchronous, the nowait clause can be used on the dispatch construct to specify that asynchronous execution is desired. In that case, synchronization needs to be handled by the application using standard OpenMP synchronization functionality, for example the omp taskwait construct.

In order to offload to a device, arguments to the oneMKL function must be mapped to the device memory if they represent a return value (marked with intent(out) or intent(inout) in the subroutine declaration) or if they point to an array of data (such as a matrix or vector, even if it is an input array). Users must map these arguments to the device using the omp target data construct before calling the oneMKL routine.

In Fortran, the OpenMP Offload interfaces have stricter type checking than the standard Fortran interfaces for the same functions. For BLAS functions and BLAS-like extensions, you can bypass this stricter type checking by changing the module that is loaded. For example, in the example below, include use onemkl_blas_omp_offload_lp64_no_array_check instead of use onemkl_blas_omp_offload_lp64.

Example

Examples for using the OpenMP offload for oneMKL are in the Intel® oneAPI Math Kernel Library (oneMKL) installation directory, under:

examples/f_offload
include "mkl_omp_offload.f90"

program sgemm_example
use onemkl_blas_omp_offload_lp64
use common_blas  

character*1 :: transa = 'N', transb = 'N'
integer :: i, j, m = 5, n = 3, k = 10
integer :: lda, ldb, ldc
real :: alpha = 1.0, beta = 1.0
real,allocatable :: a(:,:), b(:,:), c(:,:)

! initialize leading dimension and allocate and initialize arrays
lda = m
…
allocate(a(lda,k))
…
 
! initialize matrices
call sinit_matrix(transa, m, k, lda, a)
…

! Calling sgemm on the CPU using standard oneMKL Fortran interface
call sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)

! map the a, b and c matrices on the device memory
!$omp target data map(a,b,c)

! Calling sgemm on the GPU using standard oneMKL Fortran interface within a dispatch construct
!$omp dispatch
call sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)

!$omp end target data

! Free memory
deallocate(a)
…
stop
end program