This recipe focuses on a step-by-step approach to optimize a SYCL application running on the Intel® GPU platform using Intel® Advisor.
This recipe describes how to analyze the application performance using the Intel Advisor GPU Roofline. With Intel Advisor built-in recommendations, you can improve the application performance by 1.63 x compared to the baseline result, making iterative changes to the source code. The sections below describe all the optimization steps in detail.
This section lists the hardware and software used to produce the specific result shown in this recipe:
Available for download at https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html
QuickSilver sample application
Available for download from GitHub* at https://github.com/oneapi-src/Velocity-Bench/tree/main/QuickSilver/SYCL/src
Available at https://github.com/oneapi-src/Velocity-Bench/tree/main/QuickSilver/Examples
Available for download at standalone components catalog
$ source <oneapi-install-dir>/setvars.sh
$ icpx --version
$ advisor --version
If all is set up correctly, you should see a version of each tool.
$ git clone https://github.com/oneapi-src/Velocity-Bench.git
$ cd ~/Velocity-Bench/QuickSilver/SYCL/
$ mkdir build
$ cd build
$ CXX=icpx cmake ..
$ make -sj
You should see the
qs executable in the current directory.
$ QS_DEVICE=GPU ./qs -i ../../Examples/AllScattering/scatteringOnly.inp
On the console output, look for
Figure Of Merit which is the performance metric for this application (higher the better).
$ QS_DEVICE=GPU advisor -collect roofline --profile-gpu -gpu-sampling-interval=0.1 --project-dir=qs_base_run -- ./qs_base -i ../../Examples/AllScattering/scatteringOnly.inp
In the example below, the Recommendations tab contains two suggestions for the GPU kernels running on Intel® Data Center GPU Max 1550. They refer to:
[[intel::reqd_sub_group_size(<SIMD_width>)]]
Locate the
main.cc.dpc.cpp file from
QuickSilver source application and make the necessary changes as follows:
$ cd <quicksilver>/SYCL/src
$ vi main.cc.dp.cpp
The example in figure below shows how to change main.cc.dp.cpp to set SIMD width to 16.
$ make -sj
$ QS_DEVICE=GPU advisor -collect roofline --profile-gpu -gpu-sampling-interval=0.1 --project-dir=qs_simd_16_change -- ./qs_simd_16 -i ../../Examples/AllScattering/scatteringOnly.inp
The GPU Roofline Regions tab also include the Register Spilling metric. Register spilling can cause significant performance degradation, especially when spills occur inside hot loops. When variables are not promoted to registers, accesses to these variables incur significant increase of memory traffic. In this example, this metric has value of 1408 B.
By default, small register mode (128 GRF) is used. To avoid register spill, it is recommended to use large register mode (256 GRF). For that, do the following:
"-fsycl-targets=spir64 -Xs \"-options -ze-opt-large-register-file\" "
$ make -sj
$ QS_DEVICE=GPU advisor -collect roofline --profile-gpu -gpu-sampling-interval=0.1 --project-dir=qs_larger_grf_change -- ./qs_simd_16 -i ../../Examples/AllScattering/scatteringOnly.inp
The current work group size seen from the roofline perspective is now 16. You can experiment with increasing this value to 32 or 64, and observe the performance implications.
The GPU Roofline Regions tab has one more recommendation on the private memory usage presence You can try to use local memory instead of global memory for variables.