Distributed DFT#
Starting from Intel® oneAPI Math Kernel Library (oneMKL) 2025.2 release, a DPC++ interface for computing Distributed Discrete Fourier Transforms is introduced. This is designed to perform FFTs on a collection (single or multi mode) of SYCL GPU devices, where each individual GPU device is accessible within its respective process. To organize communication between different processes, the interface uses the Message Passing Interface (MPI). This interface declares the oneapi::mkl::experimental::dft namespace, which contains
the scoped enumerations
oneapi::mkl::experimental::dft::distributed_config_paramandoneapi::mkl::experimental::dft::distributed_config_value;the
oneapi::mkl::experimental::dft::distributed_descriptorclass template;the
oneapi::mkl::experimental::dft::compute_forwardandoneapi::mkl::experimental::dft::compute_backwardfunction templates.
This new interface closely resembles the single process DPC++ interface and thus re-uses the scoped enumerations defined in the oneapi::mkl::dft namespace for configuring and executing a DFT distributed across multiple processes. For a DFT of forward domain and floating-point format represented by the values dom and prec (known at compile time) of respective types oneapi::mkl::dft::domain and oneapi::mkl::dft::precision (see the scoped enumerations), the desired global transform and its general configuration are to be communicated uniformly across each process via an object of the oneapi::mkl::experimental::dft::distributed_descriptor<prec, dom> class. Once successfully committed across all the involved processes to the desired DFT configuration and to a user-provided local sycl::queue instance, that oneapi::mkl::experimental::dft::distributed_descriptor object can be used as an argument to the appropriate compute function(s) along with the relevant local chunks of input and output data.
The distributed DFT DPC++ interface computes a DFT in five steps:
Each process creates a
oneapi::mkl::experimental::dft::distributed_descriptorobjectdist_descfor the targeted global DFT problem with a call to the relevant parameterized constructor, e.g.,distributed_descriptor<prec, dom> dist_desc(MPI_COMM_WORLD, lengths);wherein
precanddomare specialization values of typesoneapi::mkl::dft::precisionandoneapi::mkl::dft::domain, respectively.dist_desccaptures the configuration of the global transform, such as the dimensionality (or rank), length(s), number of transforms, layout of the input and output data (defined by strides, distances, and possibly other configuration parameters), scaling factors, etc. All the configuration settings are assigned default values in this call, which might need to be modified thereafter.By default,
distributed_descriptorobjects within each process are initialized for the in-place calculation of an unbatched (\(M = 1\)), unscaled (\(\sigma_{\delta} = 1, \ \forall \delta\)) global DFT of the forward domain, precision and length(s) set at construction.Optionally adjust the configuration of
dist_descby calling its relevant configuration-setting member function(s) as many times as needed including the data distribution configuration. The value associated with (almost) any configuration parameter can be obtained with the appropriate configuration-querying member function(s) (default values are returned unless the queried configuration parameter was previously set). The configurations defining the global transform must be set uniform across all processes, otherwise the behavior is undefined(except for the custom distribution configuration).Commit
dist_descwith a call to itscommitmember function; that is, make the object ready to compute the global transform. All thedist_descobjects across the processes need to be successfully committed for performing the global transform. Once the objects are committed, the configuration parameters of the global DFT, are considered frozen for computation purposes: changing any of them after committing the object effectively invalidates it for computation purposes until thecommitmember function is called again. The commit member function takes in asycl::queueobject which is built upon thesycl::deviceobject which is mapped to a physical device by MPI.Use the committed
dist_descto query the local size of the device memory allocations needed for the respective domain(forward or backward) within each process and initialize the input data. The configuration-querying member function can be used to achieve this.Use the committed
dist_descto call the (appropriate)oneapi::mkl::experimental::dft::compute_forwardoroneapi::mkl::experimental::dft::compute_backwardfunctions as needed to compute the desired global transform(s). These functions require no other argument than a committed distributed descriptor object and the device-accessible input and output data.
Supported functionality and limitations#
- Only 2D and 3D transforms are supported, with the following limitations,
The dimensions must have length greater than or equal to the number of processes.
Batching is not supported.
Only default packed layouts are supported.
Supports Intel® Data Center GPU Max Series device only.
All the processes must be provided the same MPI communicator.
Currently only supports Intel® MPI.
The environment variable
I_MPI_OFFLOADmust be set to1to be functional; otherwise, an exception will be thrown.
Note
The mapping of the available SYCL devices to the processes is controlled by MPI.