Debug a SYCL* Application on a GPU
Contents
Tutorial: Debugging with Intel® Distribution for GDB*
Debug a SYCL* Application on a GPU#
Use a simple SYCL application named Array
Transform application
to perform basic debugging operations, such as break
, run
,
print
, continue
, info
, disassemble
, and next
.
This tutorial describes how to interact with SIMD lanes, as additional
thread elements. The application being debugged is instructed to run on
a GPU by setting the ONEAPI_DEVICE_SELECTOR=level_zero:gpu
environment variable.
The debug array transform application used in this tutorial can be found
in the Intel oneAPI sample
repo or by way of
the oneapi-cli sample browser tool. After you have installed and
initialized the Intel oneAPI Base Toolkit (sourced setvars.sh),
run oneapi-cli --help
in your terminal command line. The sample includes
a build script to create an application that can be debugged and run on
either a CPU or a GPU (the compiler debug flags are set during the
build).
Before you proceed, make sure you have completed all necessary setup steps described in the Get Started Guide.
Basic Debugging#
Note
For your convenience, all common Intel Distribution for GDB commands used in examples below are provided in the reference sheet.
Consider the array-transform.cpp
example again:
52 h.parallel_for(data_range, [=](id<1> index) {
53 size_t id0 = GetDim(index, 0);
54 int element = in[index]; // breakpoint-here
55 int result = element + 50;
56 if (id0 % 2 == 0) {
57 result = result + 50; // then-branch
58 } else {
59 result = -1; // else-branch
60 }
61 out[index] = result;
62 });
If you have not already done so, start the debugger:
gdb-oneapi array-transform
Start the debugger, set two breakpoints inside the kernel (one for each conditional branch) as follows:
break 57
Expected output:
Breakpoint 1 at 0x40583c: file /path/to/array-transform.cpp, line 57.
break 59
Expected output:
Breakpoint 2 at 0x40584a: file /path/to/array-transform.cpp, line 59.
Note
Do not expect your output to exactly match that provided in the tutorial. The output may vary due to the nature of parallelism and different machine properties. The ellipsis […] denotes output omitted for brevity.
Note
The default choice for offloading is a Level Zero GPU device, if it is
available. If you have not yet set the ONEAPI_DEVICE_SELECTOR
environment variable, you can do so from GDB by typing:
set env ONEAPI_DEVICE_SELECTOR=level_zero:gpu
To display the value of the variable, execute:
show env ONEAPI_DEVICE_SELECTOR
To start the program, execute:
run
You should see the following output:
Starting program: /path/to/array-transform
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
intelgt: gdbserver-ze started for process 8194.
[New Thread 0x7fffed706700 (LWP 8213)]
[SYCL] Using device: [Intel(R) Data Center GPU Flex Series 140 [0x56c1]] from [Intel(R) Level-Zero]
[Switching to Thread 1.129 lane 1]
Thread 2.129 hit Breakpoint 2, with SIMD lanes [1 3 5 7], main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:59
59 result = -1; // else-branch
(gdb)
The debugger has a mechanism called “Auto-Attach”
that spawns an instance of gdbserver-ze
to listen to and control the
GPU for debug. In the example above, the auto-attach mechanism is
triggered and the gdbserver-ze
is added to the debugger as an
inferior. An inferior in GDB represents the unit under debug. In
our case, the host application process and the GPU device each
correspond to an inferior.
Check the presence of gdbserver-ze
as follows:
info inferiors
Expected output:
(gdb) info inferiors
Num Description Connection Executable
1 process 8194 1 (native) /path/to/array-transform
* 2 device [37:00.0] 2 (remote | gdbserver-ze --attach - 8194)
Type "info devices" to see details of the devices.
Execute the info devices
command to see the further details of the device.
info devices
Expected output:
Location Sub-device Vendor Id Target Id Cores Device Name
* [37:00.0] - 0x8086 0x56c1 128 Intel(R) Data Center GPU Flex Series 140 [0x56c1]
Note
The auto-attach feature sets schedule-multiple
to on
, which
allows all threads of all inferiors to be resumed during the same session.
For example, when you run the continue
command, all inferiors
will continue.
The breakpoint event is received from Inferior 2, which represents the GPU. The thread ID 2.129:1 points to the thread 129 of the inferior 2 and indicates that the first active SIMD lane is now in focus.
The breakpoint at line 59 is hit first. The order of branch execution is defined by the Intel® Graphics Compiler.
Note
The behavior of the debugger may vary if the compiled code is optimized.
For the best debugging experience, the compiler flags -g -O0
are recommended.
Check which SIMD lanes are currently active with the following command.
The -stopped
flag filters out GPU threads that are currently
unavailable (e.g. not utilized by the program).
We recommend using it to obtain a more concise output.
We also recommend using the with print frame-arguments none --
prefix
to reduce the overhead of the command, which can be noticeably large because
of having to fetch the state of a large number of GPU threads.
with print frame-arguments none -- info threads -stopped
In the example, thread 2.129 has 4 active SIMD lanes: 1, 3, 5, and 7. The asterisk ‘*’ marks the current SIMD lane. See the expected output below.
Note
SIMD lane enumeration starts from 0.
(gdb) with print frame-arguments none -- info threads -stopped
Id Target Id Frame
1.1 Thread 0x7ffff598fb80 (LWP 8194) "array-transform" [...]
1.2 Thread 0x7fffed706700 (LWP 8213) "array-transform" [...]
* 2.129:1 Thread 1.129 <frame> at array-transform.cpp:59
2.129:[3 5 7] Thread 1.129 <frame> at array-transform.cpp:59
2.137:[1 3 5 7] Thread 1.137 <frame> at array-transform.cpp:59
2.145:[1 3 5 7] Thread 1.145 <frame> at array-transform.cpp:59
2.153:[1 3 5 7] Thread 1.153 <frame> at array-transform.cpp:59
2.193:[1 3 5 7] Thread 1.193 <frame> at array-transform.cpp:59
2.201:[1 3 5 7] Thread 1.201 <frame> at array-transform.cpp:59
2.209:[1 3 5 7] Thread 1.209 <frame> at array-transform.cpp:59
2.217:[1 3 5 7] Thread 1.217 <frame> at array-transform.cpp:59
To switch the focus to a different SIMD lane, use the
thread <thread_ID>
command. Thread ID is specified by a
triple: inferior.thread:lane
. See examples of working with
particular lanes:
thread 2.129:3
Example output:
[Switching to thread 2.129:3 (Thread 1.129 lane 3)] #0 main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:59 59 result = -1; // else-branch
print element
Example output:
$1 = 111
thread 2.129:5
Example output:
[Switching to thread 2.129:5 (Thread 1.129 lane 5)] #0 main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:59 59 result = -1; // else-branch
print element
Example output:
$2 = 113
Note
To filter threads for a specific device or sub-device one needs to:
Obtain the corresponding inferior number via
info inferiors
command:(gdb) info inferiors Num Description Connection Executable 1 process 25855 1 (native) [...] * 2 device [3a:00.0].0 2 (remote | gdbserver-ze --attach - 25855) 3 device [3a:00.0].1 2 (remote | gdbserver-ze --attach - 25855) Type "info devices" to see details of the devices.
Run the
info threads
command and supply the obtained inferior numbers, followed by a star-wildcard thread range.*
:(gdb) with print frame-arguments none -- info threads 2.* 3.* Id Target Id Frame 2.1:[0-15] Thread 1.1 <frame> at array-transform.cpp:53 2.2:[0-15] Thread 1.2 <frame> at array-transform.cpp:53 ... 3.1:[0-15] Thread 2.1 <frame> at array-transform.cpp:53 3.2:[0-15] Thread 2.2 <frame> at array-transform.cpp:53 ...
Note
In the thread ID, the inferior number can be skipped. In this case, the current inferior ID is used. The thread number can also be skipped in case of switching to a lane in the current thread. Thus, the command below can be used to switch to the desired SIMD lane:
thread :7
Expected output:
[Switching to thread 2.129:7 (Thread 1.129 lane 7)]
#0 main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:59
59 result = -1; // else-branch
As you are now inside the kernel running on the GPU, you can look into the assembly code and GPU registers, for example, to understand the cause of unexpected application behavior. Get the GPU assembly code to inspect generated instructions by executing the following command:
disassemble
See an example output below:
Dump of assembler code for function _ZZZ4mainENKUlRT_E_clIN4sycl3_V17handlerEEEDaS0_ENKUlNS4_2idILi1EEEE_clES7_:
0xffff8000ffe87200 <+0>: (W) shr (1|M16) a0.2<1>:ud r126.7<0;1,0>:ud 0x4:ud {F@1}
0xffff8000ffe87210 <+16>: (W) add (1|M16) r126.0<1>:ud r125.2<0;1,0>:ud 0x0:ud
0xffff8000ffe87220 <+32>: (W) send.ugm (1|M16) null r126 r125:1 a0.2 0x4200C504 {ExBSO,A@1,$0} // wr:1+1, rd:0; store.ugm.d32x8t.a32.ss[a0.2]
0xffff8000ffe87230 <+48>: (W) mov (1|M16) r125.3<1>:ud r125.2<0;1,0>:ud {$0.src}
0xffff8000ffe87240 <+64>: (W) add (1|M16) r125.2<1>:ud r125.2<0;1,0>:ud 0x180:ud
0xffff8000ffe87250 <+80>: (W) add (1|M16) r126.0<1>:ud r125.3<0;1,0>:ud 0x40:ud {I@2}
0xffff8000ffe87260 <+96>: (W) send.ugm (1|M16) null r126 r60:4 a0.2 0x4200E504 {ExBSO,A@1,$1} // wr:1+4, rd:0; store.ugm.d32x32t.a32.ss[a0.2]
0xffff8000ffe87270 <+112>: (W) add (1|M16) r126.0<1>:ud r125.3<0;1,0>:ud 0xC0:ud {$1.src}
0xffff8000ffe87280 <+128>: (W) send.ugm (1|M16) null r126 r64:4 a0.2 0x4200E504 {ExBSO,A@1,$2} // wr:1+4, rd:0; store.ugm.d32x32t.a32.ss[a0.2]
To learn more about GEN assembly and registers, refer to the “Introduction to GEN assembly” article.
To display a list of GPU registers, run the following command:
info registers
You can use registers to see the state of the application or inspect arithmetic instructions, such as which operands are used and where the result is located.
Additionally, you can inspect the execution mask ($emask
register),
which shows active lanes. To print the result in binary format, use the
/t
format flag as follows:
print/t $emask
Example output:
$3 = 10101010
Recall that you have stopped at line 59, the else-branch of the
condition that checks evenness of the work-item index. Hence, every
other SIMD lane is inactive, as indicated by the $emask
bit pattern.
To move forward and stop at the then-branch, set the scheduler-locking
mode to step and execute the next
command. The
set scheduler-locking step
command keeps the other threads stopped
while the current thread is stepping:
set scheduler-locking step
next
You should see the following output:
[Switching to SIMD lane 0]
Thread 2.129 hit Breakpoint 1, with SIMD lanes [0 2 4 6], main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:57
57 result = result + 50; // then-branch
Due to the breakpoint event, the SIMD lane focus switches to the first active lane in the then-branch, which is SIMD lane 0. Other threads of inferior 2 stayed at the line 59:
with print frame-arguments none -- info threads -stopped
Example output:
Id Target Id Frame
1.1 Thread 0x7ffff598fb80 (LWP 8194) "array-transform" [...]
1.2 Thread 0x7fffed706700 (LWP 8213) "array-transform" [...]
* 2.129:0 Thread 1.129 <frame> at array-transform.cpp:57
2.129:[2 4 6] Thread 1.129 <frame> at array-transform.cpp:57
2.137:[1 3 5 7] Thread 1.137 <frame> at array-transform.cpp:59
2.145:[1 3 5 7] Thread 1.145 <frame> at array-transform.cpp:59
2.153:[1 3 5 7] Thread 1.153 <frame> at array-transform.cpp:59
2.193:[1 3 5 7] Thread 1.193 <frame> at array-transform.cpp:59
2.201:[1 3 5 7] Thread 1.201 <frame> at array-transform.cpp:59
2.209:[1 3 5 7] Thread 1.209 <frame> at array-transform.cpp:59
2.217:[1 3 5 7] Thread 1.217 <frame> at array-transform.cpp:59
Since the thread is vectorized, you can also inspect the vector of a local variable:
x /8dw &result
Example output:
0xffffd556ab1627e0: 158 -1 160 -1
0xffffd556ab1627f0: 162 -1 164 -1
SIMD Lanes#
To investigate the program state from the point of view of SIMD lanes
without switching, use the thread apply
command.
You can specify a SIMD lane as a number:
thread apply 2.129:2 print element
Example output:
Thread 2.129:2 (Thread 1.129 lane 2):
$5 = 110
You can also specify a SIMD lane as a range. In this case, only active SIMD lanes from the range are considered:
thread apply 2.129:2-5 print element
Example output:
Thread 2.129:2 (Thread 1.129 lane 2):
$11 = 110
warning: SIMD lane 3 is inactive in thread 2.129
Thread 2.129:4 (Thread 1.129 lane 4):
$12 = 112
warning: SIMD lane 5 is inactive in thread 2.129
To denote all active SIMD lanes, use the wildcard:
thread apply 2.129:* print element
Example output:
Thread 2.129:0 (Thread 1.129 lane 0):
$13 = 108
Thread 2.129:2 (Thread 1.129 lane 2):
$14 = 110
Thread 2.129:4 (Thread 1.129 lane 4):
$15 = 112
Thread 2.129:6 (Thread 1.129 lane 6):
$16 = 114
To apply the command to all active SIMD lanes of all threads, use
the all-lanes
parameter:
thread apply all-lanes print element
Example output:
Thread 2.217:7 (Thread 1.217 lane 7):
$17 = 155
Thread 2.217:5 (Thread 1.217 lane 5):
$18 = 153
[...]
Thread 2.129:2 (Thread 1.129 lane 2):
$47 = 110
Thread 2.129:0 (Thread 1.129 lane 0):
$48 = 108
Thread 1.2 (Thread 0x7fffed706700 (LWP 8213) "array-transform"):
No symbol "element" in current context.
You can mix SIMD lane ranges with thread ranges and the thread wildcard. For example, to apply the command to all active lanes of all threads of inferior 2, you can use any of the following commands:
thread apply 2.127-129:*
thread apply 2.*:*
If the current inferior is 2, the inferior number can be skipped:
thread apply 127-129:*
thread apply *:*
If you need a formatted output for a set of threads, thread apply
might
be used together with the printf
command, as in the following examples.
A more compact output in comparison to thread apply *:* print element:
(gdb) thread apply *:* -q printf "%d.%d:%d element=%d\n",$_inferior,$_thread,$_simd_lane,element 2.129:0 element=108 2.129:1 element=109 2.129:2 element=110 2.129:3 element=111 2.129:4 element=112 2.129:5 element=113 2.129:6 element=114 2.129:7 element=115 2.137:0 element=124 2.137:1 element=125 2.137:2 element=126 2.137:3 element=127 2.137:4 element=128 2.137:5 element=129 2.137:6 element=130 2.137:7 element=131 2.145:0 element=140 <...>
In the above command,
-q
flag is used to suspend the standard thread information, usually printed bythread apply
. To print the thread context in a compact way, three convenience variables were used:$_inferior
to get the inferior number;$_thread
to get the thread number within the inferior;$_simd_lane
to get the SIMD lane.
To get a more hierarchical view, one can supply
thread apply *
, which applies a command to all threads of the current inferior, with the commandthread apply :* <printing command>
. The latter applies the printing command to every active SIMD lane of a thread, selected by the formerthread apply *
. The result might look as the following:(gdb) thread apply * -s thread apply :* -q printf "dim0{%d}=%d \n",$_simd_lane,id0 Thread 2.129:0 (Thread 1.129 lane 0): dim0{0}=8 dim0{1}=9 dim0{2}=10 dim0{3}=11 dim0{4}=12 dim0{5}=13 dim0{6}=14 dim0{7}=15 Thread 2.137:0 (Thread 1.137 lane 0): dim0{0}=24 dim0{1}=25 dim0{2}=26 dim0{3}=27 dim0{4}=28 dim0{5}=29 dim0{6}=30 dim0{7}=31 Thread 2.145:0 (Thread 1.145 lane 0): dim0{0}=40 <...>
Work-Item Coordinates#
The GPGPU execution model defines a work-item as one of parallel executions of a kernel function.
Use the convenience variables $_thread_workgroup
, $_workitem_local_id
,
and $_workitem_global_id
to get the coordinates of the work-item processed by
the current context, defined by the current thread and its current lane.
(gdb) print $_thread_workgroup
$1 = {<x>: 0, <y>: 0, <z>: 0}
(gdb) print $_workitem_local_id
$2 = {<x>: 56, <y>: 0, <z>: 0}
(gdb) print $_workitem_global_id
$3 = {<x>: 56, <y>: 0, <z>: 0}
Please note that the above convenience variables show work-item coordinates using X-Y-Z notation, as per execution model of the device, while SYCL execution model defines coordinates in notation of dimensions 1-2-3. SYCL RT often performs an optimization, such that SYCL dimensions are transposed and 1-2-3 corresponds to Z-Y-X.
Note
The coordinates are available only for work-items that are currently being processed. If a work-item has not yet been started or was already finished, we cannot find a thread which has processed it.
Find a Specific Work-item#
Using the convenience variables you can find a thread and its lane, which works on a specific work-item.
The first option to find the work-item is to define a conditional breakpoint. However, for a program with many threads, it could take time, till the breakpoint is hit. In the following example, we set the conditional breakpoint for the work-item with the global ID
{37,0,0}
:(gdb) break 54 if $_workitem_global_id=={37,0,0} Breakpoint 1 at 0x407093: file /home/gta/sources/oneAPI-samples/Tools/ApplicationDebugger/array-transform/src/array-transform.cpp, line 54. (gdb) run Starting program: /home/gta/sources/oneAPI-samples/Tools/ApplicationDebugger/array-transform/build/array-transform [...] [SYCL] Using device: [Intel(R) Data Center GPU Flex 140] from [Intel(R) Level-Zero] [Switching to Thread 1.209 lane 5] Thread 2.209 hit Breakpoint 1.2, with SIMD lane 5, main::{lambda(auto:1&)#1}::operator()<sycl::_V1::handler>(sycl::_V1::handler&) const::{lambda(sycl::_V1::id<1>)#1}::operator()(sycl::_V1::id<1>) const (this=0xffffd556ab1e5810, index=...) at array-transform.cpp:54 54 int element = in[index]; // breakpoint-here (gdb) print $_workitem_global_id $1 = {<x>: 37, <y>: 0, <z>: 0} (gdb)
The second option is to use the
thread apply
command and store the found thread ID and lane number into the convenience variables$thr
and$lane
. The$found
variable shows whether the search was successful. In the following example, we search for a work-item with the global ID{47,0,0}
, and then switch to the found thread and lane:(gdb) thread apply *:* -q -s set $found=($_workitem_global_id == {47,0,0}) ? ($thr=$_thread) && ($lane = $_simd_lane) : $found (gdb) print $found $191 = true (gdb) thread $thr:$lane [Switching to thread 2.145:7 (Thread 1.145 lane 7)] #0 main::[...] 54 int element = in[index]; // breakpoint-here (gdb) print $_workitem_global_id $192 = {<x>: 47, <y>: 0, <z>: 0}
Filter Threads by a Work-group#
By combining thread apply
and eval
, we can filter threads by
a specific expression. In the following, we filter by
$_thread_workgroup=={0,0,0}
.
First, we construct a convenience variable $ids
that holds a stringified
list of qualified ids (<inferior num>.<thread num>
), which belong to
the work-group:
(gdb) set $ids=""
(gdb) thread apply * -s -q eval "set $ids=($_thread_workgroup == {0,0,0}) ? \"%s %d.%d\" : \"%s\"",$ids,$_inferior,$_thread, $ids
(gdb) print $ids
$2 = " 2.129 2.137 2.145 2.153 2.193 2.201 2.209 2.217"
Note that the variable $ids
must be initialized with an empty string
first. The eval
GDB command is used here to append the list of already
found ids to the newly found one, or leave it without change, if
the condition does not hold.
Now the convenience variable $ids
contains the list of filtered thread ids.
To call info threads
for these ids, we need to use eval
again, since
the info threads
command cannot take a list of threads stored
in a convenience variable:
(gdb) eval "info threads %s", $ids
Id Target Id Frame
* 2.129:0 Thread 1.129 main::[...] at array-transform.cpp:54
2.129:[1-7] Thread 1.129 main::[...] at array-transform.cpp:54
2.137:[0-7] Thread 1.137 main::[...] at array-transform.cpp:54
2.145:[0-7] Thread 1.145 main::[...] at array-transform.cpp:54
2.153:[0-7] Thread 1.153 main::[...] at array-transform.cpp:54
2.193:[0-7] Thread 1.193 main::[...] at array-transform.cpp:54
2.201:[0-7] Thread 1.201 main::[...] at array-transform.cpp:54
2.209:[0-7] Thread 1.209 main::[...] at array-transform.cpp:54
2.217:[0-7] Thread 1.217 main::[...] at array-transform.cpp:54
Breakpoint Actions#
You can define a set of actions for a breakpoint to be executed when the breakpoint is hit. By default, the actions are executed in the context of the SIMD lane selected after the hit.
Quit the current debugging session and start a new one:
quit
gdb-oneapi array-transform
Define two temporary breakpoints with actions for the if and else branches:
Set a temporary breakpoint:
tbreak 59
Example output:
Temporary breakpoint 1 at 0x40584a: file /path/to/array-transform.cpp, line 59.
Define an action:
commands
When you are asked to type commands, enter the following:
print element end
When you are done with each command, finish with the
end
keyword.
Set another temporary breakpoint:
tbreak 57
Example output:
Temporary breakpoint 2 at 0x40583c: file /path/to/array-transform.cpp, line 57.
Define an action to be executed for all SIMD lines by adding the
/a
modifier:commands /a
When you are asked to type commands, enter the following:
print element end
Start the program:
run
Example output:
[...]
Thread 2.129 hit Temporary breakpoint 1, with SIMD lanes [1 3 5 7], main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:59
59 result = -1; // else-branch
$1 = 109
Continue to hit both breakpoints:
continue
Example output:
Continuing.
Thread 2.129 hit Temporary breakpoint 2, with SIMD lanes [0 2 4 6], main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:57
57 result = result + 50; // then-branch
$2 = 108
$3 = 110
$4 = 112
$5 = 114
The action for the breakpoint at the else branch was executed for a single SIMD lane 1, while the action at the then branch was executed for all active SIMD lanes.
Note
For conditional breakpoints, the actions are executed only for SIMD lanes that meet the condition.
Conditional Breakpoints#
Quit the debugging session and start the program from the beginning:
quit
ONEAPI_DEVICE_SELECTOR=level_zero:gpu gdb-oneapi array-transform
This time set a breakpoint at line 57 with the condition
element==106
:
break 57 if element == 106
Example output:
Breakpoint 1 at 0x40583c: file /path/to/array-transform.cpp, line 57.
Run the program (execute the run
command) and check if the
output looks as follows:
Starting program: <path_to_array-transform>
[...]
[Switching to Thread 1.193 lane 6]
Thread 2.193 hit Breakpoint 1, with SIMD lane 6, main::{lambda(auto:1&)#1}::operator()[...] at array-transform.cpp:57
57 result = result + 50; // then-branch
(gdb)
The condition is true for the lane 6 in thread 2.193.
Note
A breakpoint condition is evaluated only for active SIMD lanes,
meaning that (gdb) break 57 if element == 107
does not cause a
stop, since element == 107
is true for the lane 7 in thread 2.193,
which is inactive at line 57.