Control Binary Execution on Multiple CPU Cores

Environment Variables

The following environment variables control the placement of SYCL* or OpenMP* threads on multiple CPU cores during program execution. Use these variables if you are using the OpenCL™ runtime CPU device to offload to a CPU.

Table 3 SYCL* or OpenMP* evnironmental variables

Environment Variable

Description

DPCPP_CPU_CU_AFFINITY

Set thread affinity to CPU. The value and meaning is the following:

  • close - threads are pinned to CPU cores successively through available cores.

  • spread - threads are sread to available cores.

  • master - threads are put in the same cores as master. If DPCPP_CPU_CU_AFFINITY is set, master thread is pinned as well, otherwise master thread is not pinned.

This environment variable is similar to the OMP_PROC_BIND variable used by OpenMP.

Default: Not set

DPCPP_CPU_SCHEDULE

Specify the algorithm for scheduling work-groups by the scheduler. Currently, the SYCL runtime uses Intel® oneAPI Threading Building Blocks (Intel® oneTBB) for scheduling. The value selects the petitioner used by the Intel oneTBB scheduler. The value and meaning is the following:

  • dynamic - Intel oneTBB auto_partitioner. It performs sufficient splitting to balance load.

  • affinity - Intel oneTBB affinity_partitioner. It improves auto_partitioner’s cache affinity by its choice of mapping subranges to worker threads compared to

  • static - Intel oneTBB static_partitioner. It distributes range iterations among worker threads as uniformly as possible. Intel oneTBB partitioner relies grain-size to control chunking. Grain-size is 1 by default, indicating every work-group can be executed independently.

Default: Dynamic

DPCPP_CPU_NUM_CUS

Set the numbers threads used for kernel execution.

To avoid over subscription, maximum value of DPCPP_CPU_NUM_CUS should be the number of hardware threads. If DPCPP_CPU_NUM_CUS is 1, all the workgroups are executed sequentially by a single thread and this is useful for debugging.

This environment variable is similar to OMP_NUM_THREADS variable used by OpenMP.

Default: Not set. Determined by Intel oneTBB.

DPCPP_CPU_PLACES

Specify the places that affinities are set. The value is { sockets | numa_domains | cores | threads }.

This environment variable is similar to the OMP_PLACES variable used by OpenMP.

If value is numa_domains, Intel oneTBB NUMA API will be used. This is analogous to OMP_PLACES=numa_domains in the OpenMP 5.1 Specification. Intel oneTBB task arena is bound to numa node and SYCL nd range is uniformly distributed to task arenas.

DPCPP_CPU_PLACES is suggested to be used together with DPCPP_CPU_CU_AFFINITY.

Default: cores

See the Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference for more information about all supported environment variables.

Example 1: Hyper-threading Enabled

Assume a machine with 2 sockets, 4 physical cores per socket, and each physical core has 2 hyper threads.

  • S<num> denotes the socket number that has 8 cores specified in a list

  • T<num> denotes the Intel® oneAPI Threading Building Blocks (Intel® oneTBB) thread number

  • “-” means unused core

DPCPP_CPU_NUM_CUS=16
   export DPCPP_CPU_PLACES=sockets
   DPCPP_CPU_CU_AFFINITY=close:    S0:[T0 T1 T2 T3 T4 T5 T6 T7]        S1:[T8 T9 T10 T11 T12 T13 T14 T15]
   DPCPP_CPU_CU_AFFINITY=spread:   S0:[T0 T2 T4 T6 T8 T10 T12 T14]     S1:[T1 T3 T5 T7 T9 T11 T13 T15]
   DPCPP_CPU_CU_AFFINITY=master:   S0:[T0 T1 T2 T3 T4 T5 T6 T7]        S1:[T8 T9 T10 T11 T12 T13 T14 T15]


   export DPCPP_CPU_PLACES=cores
   DPCPP_CPU_CU_AFFINITY=close :   S0:[T0 T8 T1 T9 T2 T10 T3 T11]     S1:[T4 T12 T5 T13 T6 T14 T7 T15]
   DPCPP_CPU_CU_AFFINITY=spread:   S0:[T0 T8 T2 T10 T4 T12 T6 T14]    S1:[T1 T9 T3 T11 T5 T13 T7 T15]
   DPCPP_CPU_CU_AFFINITY=master:   S0:[T0 T1 T2 T3 T4 T5 T6 T7]       S1:[T8 T9 T10 T11 T12 T13 T14 T15]


   export DPCPP_CPU_PLACES=threads
   DPCPP_CPU_CU_AFFINITY=close:    S0:[T0 T1 T2 T3 T4 T5 T6 T7]       S1:[T8 T9 T10 T11 T12 T13 T14 T15]
   DPCPP_CPU_CU_AFFINITY=spread:   S0:[T0 T2 T4 T6 T8 T10 T12 T14]    S1:[T1 T3 T5 T7 T9 T11 T13 T15]
   DPCPP_CPU_CU_AFFINITY=master:   S0:[T0 T1 T2 T3 T4 T5 T6 T7]       S1:[T8 T9 T10 T11 T12 T13 T14 T15]


export DPCPP_CPU_NUM_CUS=8
   DPCPP_CPU_PLACES=sockets, cores and threads have the same bindings:
   DPCPP_CPU_CU_AFFINITY=close close:    S0:[T0 - T1 - T2 - T3 -]     S1:[T4 - T5 - T6 - T7 -]
   DPCPP_CPU_CU_AFFINITY=close spread:   S0:[T0 - T2 - T4 - T6 -]     S1:[T1 - T3 - T5 - T7 -]
   DPCPP_CPU_CU_AFFINITY=close master:   S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[]

Example 2: Hyper-threading Disabled

Assume a machine with 2 sockets, 4 physical cores per socket, and each physical core has 2 hyper threads.

  • S<num> denotes the socket number that has 8 cores specified in a list

  • T<num> denotes the Intel oneTBB thread number

  • “-” means unused core

export DPCPP_CPU_NUM_CUS=8
   DPCPP_CPU_PLACES=sockets, cores and threads have the same bindings:
   DPCPP_CPU_CU_AFFINITY=close:    S0:[T0 T1 T2 T3]     S1:[T4 T5 T6 T7]
   DPCPP_CPU_CU_AFFINITY=spread:   S0:[T0 T2 T4 T6]     S1:[T1 T3 T5 T7]
   DPCPP_CPU_CU_AFFINITY=master:   S0:[T0 T1 T2 T3]     S1:[T4 T5 T6 T7]


export DPCPP_CPU_NUM_CUS=4
   DPCPP_CPU_PLACES=sockets, cores and threads have the same bindings:
   DPCPP_CPU_CU_AFFINITY=close:    S0:[T0  -  T1  -  ]     S1:[T2  -  T3  - ]
   DPCPP_CPU_CU_AFFINITY=spread:   S0:[T0  -  T2  -  ]     S1:[T1  -  T3  - ]
   DPCPP_CPU_CU_AFFINITY=master:   S0:[T0 T1 T2 T3]        S1:[ -   -   -   - ]