Debug the Offload Process
Run with Different Runtimes or Compute Devices
When an offload program fails to run correctly or produces incorrect
results, a relatively quick sanity check is to run the application on a
different runtime (OpenCL™ vs. Level Zero) or compute device (CPU vs.
GPU) using LIBOMPTARGET_PLUGIN
and OMP_TARGET_OFFLOAD
for
OpenMP* applications, and ONEAPI_DEVICE_SELECTOR
for SYCL*
applications. Errors that reproduce across runtimes mostly eliminate the
runtime as being a problem. Errors that reproduce on all available
devices mostly eliminates bad hardware as the problem.
Debug CPU Execution
Offload code has two options for CPU execution: either a “host” implementation, or the CPU version of OpenCL. A “host” implementation is a truly native implementation of the offloaded code, meaning it can be debugged like any of the non-offloaded code. The CPU version of OpenCL, while it goes through the OpenCL runtime and code generation process, eventually ends up as normal parallel code running under a TBB runtime. Again, this provides a familiar debugging environment with familiar assembly and parallelism mechanisms. Pointers have meaning through the entire stack, and data can be directly inspected. There are also no memory limits beyond the usual limits for any operating system process.
Finding and fixing errors in CPU offload execution may solve errors seen in GPU offload execution with less pain, and without requiring use of a system with an attached GPU or other accelerator.
For OpenMP applications, to get a “host” implementation, remove the
“target” or “device” constructs, replacing them with normal host OpenMP
code. If LIBOMPTARGET_PLUGIN=OPENCL
and offload to the GPU is
disabled, then the offloaded code runs under the OpenMP runtime with TBB
providing parallelism.
For SYCL applications, with ONEAPI_DEVICE_SELECTOR=host
the “host”
device is actually single-threaded, which may help you determine if
threading issues, such as data races and deadlocks, are the source of
execution errors. Setting ONEAPI_DEVICE_SELECTOR=opencl:cpu
uses the CPU
OpenCL runtime, which also uses TBB for parallelism.
Debug GPU Execution Using Intel® Distribution for GDB* on compatible GPUs
Intel® Distribution for GDB* is extensively documented in Get Started with Intel® Distribution for GDB on Linux* Host | Windows* Host. Useful commands are briefly described in the Intel® Distribution for GDB Reference Sheet. However, since debugging applications with GDB* on a GPU differs slightly from the process on a host (some commands are used differently and you might see some unfamiliar output), some of those differences are summarized here.
The Debugging with Intel® Distribution for GDB on Linux OS Host Tutorial shows a sample debug session where we start a debug session of a SYCL program, define a breakpoint inside the kernel, run the program to offload to the GPU, print the value of a local variable, switch to the SIMD lane 5 of the current thread, and print the variable again.
As in normal GDB*, for a command <CMD>
, use the help <CMD>
command of GDB to read the information text for <CMD>
. For example:
(gdb) help info threads
Display currently known threads.
Usage: info threads [OPTION]... [ID]...
If ID is given, it is a space-separated list of IDs of threads to display.
Otherwise, all threads are displayed.
Options:
-gid
Show global thread IDs.
Inferiors, Threads, and SIMD Lanes Referencing in GDB*
The threads of the application can be listed using the debugger. The printed information includes the thread ids and the locations that the threads are currently stopped at. For the GPU threads, the debugger also prints the active SIMD lanes.
In the example referenced above, you may see some unfamiliar formatting used when threads are displayed via the GDB “info threads” command:
Id Target Id Frame
1.1 Thread <id omitted> <frame omitted>
1.2 Thread <id omitted> <frame omitted>
* 2.1:1 Thread 1073741824 <frame> at array-transform.cpp:61
2.1:[3 5 7] Thread 1073741824 <frame> at array-transform.cpp:61
2.2:[1 3 5 7] Thread 1073741888 <frame> at array-transform.cpp:61
2.3:[1 3 5 7] Thread 1073742080 <frame> at array-transform.cpp:61
Here, GDB is displaying the threads with the following format:
<inferior_number>.<thread_number>:<SIMD Lane/s>
So, for example, the thread id “2.3:[1 3 5 7]
” refers to SIMD lanes
1, 3, 5, and 7 of thread 3 running on inferior 2.
An “inferior” in the GDB terminology is the process that is being debugged. In the debug session of a program that offloads to the GPU, there will typically be two inferiors; one “native” inferior representing a host part of the program (inferior 1 above), and another “remote” inferior representing the GPU device (inferior 2 above). Intel® Distribution for GDB automatically creates the GPU inferior - no extra steps are required.
When you print the value of an expression, the expression is evaluated
in the context of the current thread’s current SIMD lane. You can switch
the thread as well as the SIMD lane to change the context using the
“thread” command such as “thread 3:4
“, “thread :6
“, or
“thread 7
“. The first command makes a switch to the thread 3 and
SIMD lane 4. The second command switches to SIMD lane 6 within the
current thread. The third command switches to thread 7. The default lane
selected will either be the previously selected lane, if it is active,
or the first active lane within the thread.
The “thread apply command” may be similarly broad or focused (which can make it easier to limit the output from, for example, a command to inspect a variable). For more details and examples about debugging with SIMD lanes, see the Debugging with Intel® Distribution for GDB on Linux OS Host Tutorial.
More information about threads and inferiors in GDB can be found from https://sourceware.org/gdb/current/onlinedocs/gdb/Threads.html and https://sourceware.org/gdb/current/onlinedocs/gdb/Inferiors-Connections-and-Programs.html#Inferiors-Connections-and-Programs.
Controlling the Scheduler
By default, when a thread hits a breakpoint, the debugger stops all the threads before displaying the breakpoint hit event to the user. This is the all-stop mode of GDB. In the non-stop mode, the stop event of a thread is displayed while the other threads run freely.
In all-stop mode, when a thread is resumed (for example, to resume
normally with the continue
command, or for stepping with the
next
command), all the other threads are also resumed. If you have
some breakpoints set in threaded applications, this can quickly get
confusing, as the next thread that hits the breakpoint may not be the
thread you are following.
You can control this behavior using the set scheduler-locking
command to prevent resuming other threads when the current thread is
resumed. This is useful to avoid intervention of other threads while
only the current thread executes instructions. Type
help set scheduler-locking
for the available options, and see
https://sourceware.org/gdb/current/onlinedocs/gdb/Thread-Stops.html for
more information. Note that SIMD lanes cannot be resumed individually;
they are resumed together with their underlying thread.
In non-stop mode, by default, only the current thread is resumed. To
resume all threads, pass the “-a
” flag to the continue
command.
Dumping Information on One or More Threads/Lanes (Thread Apply)
Commands for inspecting the program state are typically executed in the
context of the current thread’s current SIMD lane. Sometimes it is
desired to inspect a value in multiple contexts. For such needs, the
thread apply
command can be used. For instance, the following
executes the print element
command for the SIMD lanes 3-5 of Thread
2.5:
(gdb) thread apply 2.5:3-5 print element
Similarly, the following runs the same command in the context of SIMD lane 3, 5, and 6 of the current thread:
(gdb) thread apply :3 :5 :6 print element
Stepping GPU Code After a Breakpoint
To stop inside the kernel that is offloaded to the GPU, simply define a
breakpoint at a source line inside the kernel. When a GPU thread hits
that source line, the debugger stops the execution and shows the
breakpoint hit. To single-step a thread over a source-line, use the
step
or next
commands. The step
commands steps into
functions while next
steps over calls. Before stepping, we recommend
to set scheduler-locking step
to prevent intervention of other
threads.
Building a SYCL Executable for Use with Intel® Distribution for GDB*
Much like when you want to debug a host application, you need to set some additional flags to create a binary that can be debugged on the GPU. See Get Started with Intel® Distribution for GDB on Linux* Host for details.
For a smooth debug experience when using the just-in-time (JIT)
compilation flow, enable debug information emission from the compiler
via the -g
flag, and disable optimizations via the -O0
flag for
both a host and JIT-compiled kernel of the application. The flags for
the kernel are taken during link time. For example:
Compile your program using:
icpx -fsycl -g -O0 -c myprogram.cpp
Link your program using:
icpx -fsycl -g -O0 myprogram.o
If you are using CMake to configure the build of your program, use the
Debug
type for the CMAKE_BUILD_TYPE
, and append -O0
to the
CMAKE_CXX_FLAGS_DEBUG
variable. For example:
set (CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -O0")
Applications that are built for debugging may take a little longer to
start up than when built with the usual “release” level of optimization.
Thus, your program may appear to run a little more slowly when started
in the debugger. If this causes problems, developers of larger
applications may want to use ahead-of-time (AOT) compilation to JIT the
offload code when their program is built, rather than when it is run
(warning, this may also take longer to build when using -g -O0
). For
more information, see Compilation Flow
Overview.
When doing ahead-of-time compilation for GPU, you must use a device type
that fits your target device. Run the following command to see the
available GPU device options on your current machine:
ocloc compile --help
Additionally, the debug mode for the kernel must be enabled. The following example AoT compilation command targets the KBL device:
dpcpp -g -O0 -fsycl-targets=spir64_gen-unknown-unknown-sycldevice \
-Xs "-device kbl -internal_options -cl-kernel-debug-enable -options -cl-opt-disable" myprogram.cpp
Building an OpenMP* Executable for use with Intel® Distribution for GDB*
Compile and link your program using the -g -O0
flags. For example:
icpx -fiopenmp -O0 -fopenmp-targets=spir64 -c -g myprogram.cpp
icpx -fiopenmp -O0 -fopenmp-targets=spir64 -g myprogram.o
Set the following environment variables to disable optimizations and enable debug info for the kernel:
export LIBOMPTARGET_OPENCL_COMPILATION_OPTIONS="-g -cl-opt-disable"
export LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-g -cl-opt-disable"
Debugging GPU Execution
A common issue with offload programs is that they may to fail to run at all, instead giving a generic OpenCL™ error with little additional information. The Intercept Layer for OpenCL™ Applications along with onetrace, ze_tracer, and cl_tracer can be used to get more information about these errors, often helping the developer identify the source of the problem.
Intercept Layer for OpenCL™ Applications
Using this library, in particular the Buildlogging
,
ErrorLogging
, and USMChecking=1
options, you can often find the
source of the error.
Create a
clintercept.conf
file in the home directory with the following content:SimpleDumpProgramSource=1 CallLogging=1 LogToFile=1 //KernelNameHashTracking=1 BuildLogging=1 ErrorLogging=1 USMChecking=1 //ContextCallbackLogging=1 // Profiling knobs KernelInfoLogging=1 DevicePerformanceTiming=1 DevicePerformanceTimeLWSTracking=1 DevicePerformanceTimeGWSTracking=1
Run the application with cliloader as follows:
<OCL_Intercept_Install_Dir>/bin/cliloader/cliloader -d ./<app_name> <app_args>
Review the following results in the ~CLIntercept_Dump/<app_name> directory:
clintercept_report.txt: Profiling results
clintercept_log.txt: Log of OpenCL™ calls used to debug OpenCL issues
The following snippet is from an example log file generated by a program
that returned the runtime error: CL_INVALID_ARG_VALUE (-50)
...
<<<< clSetKernelArgMemPointerINTEL -> CL_SUCCESS
>>>> clGetKernelInfo( _ZTSZZ10outer_coreiP5mesh_i16dpct_type_1c0e3516dpct_type_60257cS2_S2_S2_S2_S2_S2_S2_S2_fS2_S2_S2_S2_iENKUlRN2cl4sycl7handlerEE197->45clES6_EUlNS4_7nd_itemILi3EEEE225->13 ): param_name = CL_KERNEL_CONTEXT (1193)
<<<< clGetKernelInfo -> CL_SUCCESS
>>>> clSetKernelArgMemPointerINTEL( _ZTSZZ10outer_coreiP5mesh_i16dpct_type_1c0e3516dpct_type_60257cS2_S2_S2_S2_S2_S2_S2_S2_fS2_S2_S2_S2_iENKUlRN2cl4sycl7handlerEE197->45clES6_EUlNS4_7nd_itemILi3EEEE225->13 ): kernel = 0xa2d51a0, index = 3, value = 0x41995e0
mem pointer 0x41995e0 is an UNKNOWN pointer and no device support shared system pointers!
ERROR! clSetKernelArgMemPointerINTEL returned CL_INVALID_ARG_VALUE (-50)
<<<< clSetKernelArgMemPointerINTEL -> CL_INVALID_ARG_VALUE
In this example, the following values help with debugging the error:
ZTSZZ10outer_coreiP5mesh
index = 3, value = 0x41995e0
Using this data, you can identify which kernel had the problems, what argument was problematic, and why.
onetrace, ze_tracer, and cl_tracer
Similar to Intercept Layer for OpenCL™ Applications, the onetrace, ze_tracer and cl_tracer tools can help find the source of errors detected by the Level Zero and OpenCL™ runtimes.
To use the onetrace or ze_tracer tools to root-cause Level Zero issues (cl_tracer would be used the same way to root-cause OpenCL issues):
Use Call Logging mode to run the application. Redirecting the tool output to a file is optional, but recommended.
./onetrace -c ./<app_name> <app_args> [2> log.txt]
The command for ze_tracer is the same - just substitute “ze_tracer” for “onetrace”.
Review the call trace to figure out the error (log.txt). For example:
>>>> [102032049] zeKernelCreate: hModule = 0x55a68c762690 desc = 0x7fff865b5570 {29 0 0 GEMM} phKernel = 0x7fff865b5438 (hKernel = 0) <<<< [102060428] zeKernelCreate [28379 ns] hKernel = 0x55a68c790280 -> ZE_RESULT_SUCCESS (0) ... >>>> [102249951] zeKernelSetGroupSize: hKernel = 0x55a68c790280 groupSizeX = 256 groupSizeY = 1 groupSizeZ = 1 <<<< [102264632] zeKernelSetGroupSize [14681 ns] -> ZE_RESULT_SUCCESS (0) >>>> [102278558] zeKernelSetArgumentValue: hKernel = 0x55a68c790280 argIndex = 0 argSize = 8 pArgValue = 0x7fff865b5440 <<<< [102294960] zeKernelSetArgumentValue [16402 ns] -> ZE_RESULT_SUCCESS (0) >>>> [102308273] zeKernelSetArgumentValue: hKernel = 0x55a68c790280 argIndex = 1 argSize = 8 pArgValue = 0x7fff865b5458 <<<< [102321981] zeKernelSetArgumentValue [13708 ns] -> ZE_RESULT_ERROR_INVALID_ARGUMENT (2013265924) >>>> [104428764] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 2 argSize = 8 pArgValue = 0x7ffe289c7e60 <<<< [104442529] zeKernelSetArgumentValue [13765 ns] -> ZE_RESULT_SUCCESS (0) >>>> [104455176] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 3 argSize = 4 pArgValue = 0x7ffe289c7e2c <<<< [104468472] zeKernelSetArgumentValue [13296 ns] -> ZE_RESULT_SUCCESS (0) ...
The example log data shows:
A level zero API call that causes the problem (
zeKernelSetArgumentValue
)The problem reason (
ZE_RESULT_ERROR_INVALID_ARGUMENT
)The argument index (
argIndex = 1
)An invalid value location (
pArgValue = 0x7fff865b5458
)A kernel handle (
hKernel = 0x55a68c790280
), which provides the name of the kernel for which this issue is observed (GEMM)
More information could be obtained by omitting the “redirection to file” option and dumping all the output (application output + tool output) into one stream. Dumping to one stream may help determine the source of the error in respect to application output (for example, you can find that the error happens between application initialization and the first phase of computations):
Level Zero Matrix Multiplication (matrix size: 1024 x 1024, repeats 4 times)
Target device: Intel® Graphics [0x3ea5]
...
>>>> [104131109] zeKernelCreate: hModule = 0x55af5f39ca10 desc = 0x7ffe289c7f80 {29 0 0 GEMM} phKernel = 0x7ffe289c7e48 (hKernel = 0)
<<<< [104158819] zeKernelCreate [27710 ns] hKernel = 0x55af5f3ca600 -> ZE_RESULT_SUCCESS (0)
...
>>>> [104345820] zeKernelSetGroupSize: hKernel = 0x55af5f3ca600 groupSizeX = 256 groupSizeY = 1 groupSizeZ = 1
<<<< [104360082] zeKernelSetGroupSize [14262 ns] -> ZE_RESULT_SUCCESS (0)
>>>> [104373679] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 0 argSize = 8 pArgValue = 0x7ffe289c7e50
<<<< [104389443] zeKernelSetArgumentValue [15764 ns] -> ZE_RESULT_SUCCESS (0)
>>>> [104402448] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 1 argSize = 8 pArgValue = 0x7ffe289c7e68
<<<< [104415871] zeKernelSetArgumentValue [13423 ns] -> ZE_RESULT_ERROR_INVALID_ARGUMENT (2013265924)
>>>> [104428764] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 2 argSize = 8 pArgValue = 0x7ffe289c7e60
<<<< [104442529] zeKernelSetArgumentValue [13765 ns] -> ZE_RESULT_SUCCESS (0)
>>>> [104455176] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 3 argSize = 4 pArgValue = 0x7ffe289c7e2c
<<<< [104468472] zeKernelSetArgumentValue [13296 ns] -> ZE_RESULT_SUCCESS (0)
...
Matrix multiplication time: 0.0427564 sec
Results are INCORRECT with accuracy: 1
...
Matrix multiplication time: 0.0430995 sec
Results are INCORRECT with accuracy: 1
...
Total execution time: 0.381558 sec
Correctness
Offload code is often used for kernels that can efficiently process large amounts of information on the attached compute device, or to generate large amounts of information from some input parameters. If these kernels are running without crashing, this can often mean that you learn that they are not producing the correct results much later in program execution.
In these cases, it can be difficult to identify which kernel is producing incorrect results. One technique for finding the kernel producing incorrect data is to run the program twice, once using a purely host-based implementation, and once using an offload implementation, capturing the inputs and outputs from every kernel (often to individual files). Now compare the results and see which kernel call is producing unexpected results (within a certain epsilon - the offload hardware may have a different order of operation or native precision that causes the results to differ from the host code in the last digit or two).
Once you know which kernel is producing incorrect results, and you are working with a compatible GPU, use Intel Distribution for GDB to determine the reason. See the Debugging with Intel® Distribution for GDB on Linux OS Host Tutorial for basic information and links to more detailed documentation.
Both SYCL and OpenMP* also allow for the use of standard language
print mechanisms (printf
for SYCL and C++ OpenMP offload,
print *, ...
for Fortran OpenMP offload) within offloaded kernels,
which you can use to verify correct operation while they run. Print the
thread and SIMD lane the output is coming from and consider adding
synchronization mechanisms to ensure printed information is in a
consistent state when printed. Examples for how to do this in SYCL
using the stream class can be found in the Intel oneAPI GPU
Optimization Guide.
You could use a similar approach to the one described for SYCL for
OpenMP offload.
For more information about using OpenMP directives to add parallelism to your application, see Offload and Optimize OpenMP* Applications with Intel Tools <https://www.intel.com/content/www/us/en/developer/tools/oneapi/training/offload-optimize-openmp-applications.html>`_
Tip
Using printf
can be verbose in SYCL kernels. To simplify, add
the following macro:
#ifdef __SYCL_DEVICE_ONLY__
#define CL_CONSTANT __attribute__((opencl_constant))
#else
#define CL_CONSTANT
#endif
#define PRINTF(format, ...) { \
static const CL_CONSTANT char _format[] = format; \
sycl::ONEAPI::experimental::printf(_format, ## __VA_ARGS__); }
Usage example: PRINTF("My integer variable:%d\n", (int) x);
Failures
Just-in-time (JIT) compilation failures that occur at runtime due to incorrect use of the SYCL or OpenMP* offload languages will cause your program to exit with an error.
In the case of SYCL, if you cannot find these using ahead-of-time compilation of your SYCL code, selecting the OpenCL backend, setting SimpleDumpProgramSource and BuildLogging, and using the Intercept Layer for OpenCL™ Applications may help identify the kernel with the syntax error.
Logic errors can also result in crashes or error messages during execution. Such issues can include:
Passing a buffer that belongs to the wrong context to a kernel
Passing the “this” pointer to a kernel rather than a class element
Passing a host buffer rather than a device buffer
Passing an uninitialized pointer, even if it is not used in the kernel
Using the Intel® Distribution for GDB* (or even the native GDB), if you watch carefully, you can record the addresses of all contexts created and verify that the address being passed to an offload kernel belongs to the correct context. Likewise, you can verify that the address of a variable passed matches that of the variable itself, and not its containing class.
It may be easier to track buffers and addresses using the Intercept
Layer for OpenCL™ allocation or onetrace/cl_tracer and choosing the appropriate
backend. When using the OpenCL backend, setting CallLogging
,
BuildLogging
, ErrorLogging
, and USMChecking
and running your
program should produce output that explains what error in your code
caused the generic OpenCL error to be produced.
Using onetrace or ze_tracer’s Call Logging or Device Timeline should give additional enhanced error information to help you better understand the source of generic errors from the Level Zero backend. This can help locate many of the logic errors mentioned above.
If the code is giving an error when offloading to a device using the
Level Zero backend, try using the OpenCL backend. If the program works,
report an error against the Level Zero backend. If the error reproduces
in the OpenCL backend to the device, try using the OpenCL CPU backend.
In OpenMP offload, this can be specified by setting
OMP_TARGET_OFFLOAD
to CPU
. For SYCL, this can be done by
setting ONEAPI_DEVICE_SELECTOR=opencl:cpu
. Debugging with everything on
the CPU can be easier, and removes complications caused by data copies
and translation of the program to a non-CPU device.
As an example of a logic issue that can get you in trouble, consider
what is captured by the lambda function used to implement the
parallel_for
in this SYCL code snippet.
class MyClass {
private:
int *data;
int factor;
:
void run() {
:
auto data2 = data;
auto factor2 = factor;
{
dpct::get_default_queue_wait().submit([&](cl::sycl::handler &cgh)
{
auto dpct_global_range = grid * block;
auto dpct_local_range = block;
cgh.parallel_for<dpct_kernel_name<class kernel_855a44>>(
cl::sycl::nd_range<1>(
cl::sycl::range<1> dpct_global_range.get(0)),
cl::sycl::range<1>( dpct_local_range.get(0))),
[=](cl::sycl::nd_item<3> item_ct1)
{
kernel(data, b, factor, LEN, item_ct1); // This blows up
});
});
}
} // run
} // MyClass
In the above code snippet, the program crashes because [=]
will copy
by value all variables used inside the lambda. In the example it may not
be obvious that “factor
” is really “this->factor
” and “data
”
is really “this->data,
” so “this
” is the variable that is
captured for the use of “data
” and “factor
” above. OpenCL or
Level Zero will crash with an illegal arguments error in the
“kernel(data, b, factor, LEN, item_ct1)
” call.
The fix is the use of local variables auto data2
and
auto factor2
. “auto factor2 = factor
” becomes
“int factor2 = this->factor
” so using factor2
inside the lambda
with [=]
would capture an “int
”. We would rewrite the inner
section as “kernel(data2, b, factor2, LEN, item_ct1);
” .
Note
This issue is commonly seen when migrating CUDA* kernels. You can also resolve the issue by keeping the same CUDA kernel launch signature and placing the command group and lambda inside the kernel itself.
Using the Intercept Layer for OpenCL™ allocation or onetrace or ze_tracer, you would see that the kernel was called with two identical addresses, and the extended error information would tell you that you are trying to copy a non-trivial data structure to the offload device.
Note that if you are using unified shared memory (USM), and
“MyClass
” is allocated in USM, the above code will work. However, if
only “data
” is allocated in USM, then the program will crash for the
above reason.
In this example, note that you can also re-declare the variables in local scope with the same name so that you don’t need to change everything in the kernel call.
Intel® Inspector can also help diagnose these sorts of failures. If you set the following environment variables and then run Memory Error Analysis on offload code using the CPU device, Intel Inspector will flag many of the above issues:
OpenMP*
export OMP_TARGET_OFFLOAD=CPU
export OMP_TARGET_OFFLOAD=MANDATORY
export LIBOMPTARGET_PLUGIN=OPENCL
SYCL
export ONEAPI_DEVICE_SELECTOR=opencl:cpu
Or initialize your queue with a CPU selector to force use of the OpenCL CPU device:
cl::sycl::queue Queue(cl::sycl::cpu_selector{});
Both
export CL_CONFIG_USE_VTUNE=True
export CL_CONFIG_USE_VECTORIZER=false
Note
A crash can occur when optimizations are turned on during the
compilation process. If turning off optimizations causes your crash
to disappear, use -g -[optimization level]
for debugging. For
more information, see the Intel oneAPI DPC++/C++ Compiler Developer
Guide and
Reference.