Tutorial: Debugging with Intel® Distribution for GDB*

Multi-Device Debugging#

Debugging applications on systems with multiple GPUs and/or sub-devices is supported by the Intel® Distribution for GDB (aka gdb-oneapi), with some important restrictions and limitations.

  • When debugging an application that includes GPU “offload kernels,” each kernel uses an entire GPU sub-device, even if that kernel only utilizes a subset of the sub-device.

  • When a kernel being debugged is paused (at a breakpoint, single-stepping, etc.), the kernel remains in place on the GPU, preventing other kernels from using the GPU sub-device.

Enabling debug (ZET_ENABLE_PROGRAM_DEBUGGING=1) of your application’s offload kernels blocks parallel execution of the kernels on the sub-device, which may result in your application taking a longer time to run. When the kernel being debugged is paused it may appear as if the GPU is hung.

There are essentially three multi-device debug scenarios to be aware of:

  1. An application submits kernels to multiple devices.

  2. Multiple applications submit kernels to different devices or sub-devices.

  3. Multiple applications submit kernels to the same sub-device.

The number and type of GPUs available in a system can be listed using the sycl-ls command. The output below shows a system that has two GPU cards, which are available for use by “offload” kernels running on either the OpenCL™ backend or the Intel® oneAPI Level Zero backend.

$ sycl-ls
[opencl:gpu:0] Intel(R) OpenCL HD Graphics, Intel(R) Graphics [0x0bd5] 3.0 [22.39.24347.8]
[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) Graphics [0x0bd5] 3.0 [22.39.24347.8]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]

Note

As of the 2023.0 oneAPI product release, debugging GPU kernels with the Intel® Distribution for GDB (gdb-oneapi) is only supported on Level Zero backends. Debugging GPU kernels on OpenCL backends is no longer supported by the gdb-oneapi debugger. The ONEAPI_DEVICE_SELECTOR environment variable can be used to restrict which GPU devices, sub-devices and backends are used by your application during a debugging session.

The example below shows the output of the sycl-ls command when the ONEAPI_DEVICE_SELECTOR environment variable is set to level_zero:* (in this example, restricting the application’s offload kernels to any GPU devices available to the Level Zero backend):

$ export ONEAPI_DEVICE_SELECTOR=level_zero:*
$ sycl-ls
Warning: ONEAPI_DEVICE_SELECTOR environment variable is set to level_zero:*.
To see the correct device id, please unset ONEAPI_DEVICE_SELECTOR.

[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24347]

Scenario 1: An Application Uses Multiple Devices#

The debugger supports debugging a program that offloads multiple kernels to multiple GPU devices and/or sub-devices. Each sub-device appears in the debugger as a separate inferior. The auto-attach feature initializes the devices for debugging and creates the corresponding inferiors.

A possible output is as follows:

$ gdb-oneapi -q --args ./multi-device
Reading symbols from ./multi-device...
(gdb) break get_transformed
Breakpoint 1 at 0x40431a: file multi-device.cpp, line 27.
(gdb) run
Starting program: /path/to/multi-device
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
intelgt: gdbserver-ze started for process 581849.
[New Thread 0x7fffe4645700 (LWP 581871)]
[Switching to Thread 1.97 lane 0]

Thread 2.97 hit Breakpoint 1, with SIMD lanes [0-15], get_transformed (data=1, device_idx=0) at multi-device.cpp:27
27        return data * 3 + 11 * (device_idx + 1);

We can check the devices’ inferiors using the following command:

info inferiors

The output below presents four inferiors, one for each sub-device. The following format is used in device enumeration [<pci-location>].<sub-device-id>.

  Num  Description              Connection                                  Executable
  1    process 581849           1 (native)                                  /path/to/multi-device
* 2    device [3a:00.0].0       2 (remote | gdbserver-ze --attach - 581849)
  3    device [3a:00.0].1       2 (remote | gdbserver-ze --attach - 581849)
  4    device [9a:00.0].0       2 (remote | gdbserver-ze --attach - 581849)
  5    device [9a:00.0].1       2 (remote | gdbserver-ze --attach - 581849)
Type "info devices" to see details of the devices.

We can display further information using the following command:

info devices

A possible output is shown below:

  Location   Sub-device   Vendor Id   Target Id   Cores   Device Name
* [3a:00.0]  0            0x8086      0x0bd5      512     Intel(R) Graphics [0x0bd5]
  [3a:00.0]  1            0x8086      0x0bd5      512     Intel(R) Graphics [0x0bd5]
  [9a:00.0]  0            0x8086      0x0bd5      512     Intel(R) Graphics [0x0bd5]
  [9a:00.0]  1            0x8086      0x0bd5      512     Intel(R) Graphics [0x0bd5]

Note

Switching between the inferiors and threads is the same as explained in the Basic Debugging section.

Applications can be limited to a specific set of GPU devices and sub-devices by using the ZE_AFFINITY_MASK environment variable. For example, the same debug session above gives the output below, if run under the environment variable ZE_AFFINITY_MASK=0.0:

(gdb) info inferiors
  Num  Description              Connection                                  Executable
  1    process 581966           1 (native)                                  /path/to/multi-device
* 2    device [3a:00.0]         2 (remote | gdbserver-ze --attach - 581966)
Type "info devices" to see details of the devices.

(gdb) info devices
  Location   Sub-device   Vendor Id   Target Id   Cores   Device Name
* [3a:00.0]  -            0x8086      0x0bd5      512     Intel(R) Graphics [0x0bd5]

See the Level Zero Specification Environment Variables documentation for more details about the usage of the ZE_AFFINITY_MASK environment variable.

Scenario 2: Multiple Applications Use Different Devices and Sub-Devices#

Simultaneous debugging of applications, where each application runs under a separate instance of the debugger, is supported. For example, the Array Transform application from the Basic Debugging section can be started to utilize sub-device 0 of GPU 0 as follows:

$ ZE_AFFINITY_MASK=0.0 gdb-oneapi array-transform
...
(gdb) run gpu
...

While this first application is being debugged (e.g., GPU threads hit a breakpoint and the application’s state is under investigation), another process of the same or a different user can freely utilize another sub-device and/or GPU, e.g. sub-device 1 of GPU 0 (note the change in the affinity mask compared to the previous example):

$ ZE_AFFINITY_MASK=0.1 gdb-oneapi array-transform
...
(gdb) run gpu
...

As long as the applications use different sub-devices, simultaneous debugging works.

As an alternative to using the ZE_AFFINITY_MASK above, the applications may also select GPUs and sub-devices programmatically.

Scenario 3: Multiple Applications Use the Same Sub-Device#

A restriction to multi-device debugging occurs when different applications utilize the same sub-device. In this case, the kernel submitted by the application under debug occupies the entire sub-device during the debug session, until the kernel finishes. No other kernels can be run on the same sub-device while a kernel is being debugged. Hence, other applications submitting kernels to that sub-device may appear to be waiting indefinitely.

When debugging an MPI application it is recommended to assign at most one rank to a sub-device. Assigning more than one rank to a sub-device will serialize the ranks, resulting in pausing those ranks that are waiting in the queue during an interactive debug session.