Turn on/off main thread pinning.
Syntax
I_MPI_PIN=<arg>
Arguments
<arg> | Binary indicator |
enable | yes | on | 1 | Enable main thread pinning. This is the default value |
disable | no | off | 0 | Disable main thread pinning |
Description
Set this environment variable to control the main thread pinning feature of the Intel® MPI Library.
Define a processor subset and the mapping rules for MPI main threads within this subset.
Syntax
I_MPI_PIN_PROCESSOR_LIST=<value>
The environment variable value has the following syntax forms:
1. <proclist>
2. [<procset> ][:[grain=<grain> ][,shift=<shift> ][,preoffset=<preoffset> ][,postoffset=<postoffset> ]
3. [<procset> ][:map=<map> ]
The following paragraphs provide detail descriptions for the values of these syntax forms.
The postoffset keyword has offset alias.
The second form of the pinning procedure has three steps:
Cyclic shift of the source processor list on preoffset*grain value.
Round robin shift of the list derived on the first step on shift*grain value.
Cyclic shift of the list derived on the second step on the postoffset*grain value.
The grain, shift, preoffset, and postoffset parameters have a unified definition style.
This environment variable is available for both Intel® and non-Intel microprocessors, but it may perform additional optimizations for Intel microprocessors than it performs for non-Intel microprocessors.
Syntax
I_MPI_PIN_PROCESSOR_LIST=<proclist>
Arguments
<proclist> | A comma-separated list of logical processor numbers and/or ranges of processors. The main thread with the i-th rank is pinned to the i-th processor in the list. The number should not exceed the amount of processors on a node. |
<l> | Processor with logical number <l>. |
<l>-<m> | Range of processors with logical numbers from <l>to <m>. |
<k>,<l>-<m> | Processors <k>, as well as <l>through <m>. |
Syntax
I_MPI_PIN_PROCESSOR_LIST=[<procset>][:[grain=<grain>][,shift=<shift>][,preoffset=<preoffset>][,postoffset=<postoffset>]
Arguments
<procset> | Specify a processor subset based on the topological numeration. The default value is allcores. |
all | All logical processors. Specify this subset to define the number of CPUs on a node. |
allcores | All cores (physical CPUs). Specify this subset to define the number of cores on a node. This is the default value. If Intel® Hyper-Threading Technology is disabled, allcores equals to all. |
allsocks | All packages/sockets. Specify this subset to define the number of sockets on a node. |
<grain> | Specify the pinning granularity cell for a defined <procset>. The minimal <grain>value is a single element of the <procset>. The maximal <grain>value is the number of <procset>elements in a socket. The <grain>value must be a multiple of the <procset>value. Otherwise, the minimal <grain>value is assumed. The default value is the minimal <grain>value. |
<shift> | Specify the granularity of the round robin scheduling shift of the cells for the <procset>. <shift>is measured in the defined <grain>units. The <shift>value must be positive integer. Otherwise, no shift is performed. The default value is no shift, which is equal to 1 normal increment. |
<preoffset> | Specify the cyclic shift of the processor subset <procset>defined before the round robin shifting on the <preoffset>value. The value is measured in the defined <grain>units. The <preoffset>value must be non-negative integer. Otherwise, no shift is performed. The default value is no shift. |
<postoffset> | Specify the cyclic shift of the processor subset <procset>derived after round robin shifting on the <postoffset>value. The value is measured in the defined <grain>units. The <postoffset>value must be non-negative integer. Otherwise no shift is performed. The default value is no shift. |
The following table displays the values for <grain>, <shift>, <preoffset>, and <postoffset> options:
<n> | Specify an explicit value of the corresponding parameters. <n>is non-negative integer. |
fine | Specify the minimal value of the corresponding parameter. |
core | Specify the parameter value equal to the amount of the corresponding parameter units contained in one core. |
cache1 | Specify the parameter value equal to the amount of the corresponding parameter units that share an L1 cache. |
cache2 | Specify the parameter value equal to the amount of the corresponding parameter units that share an L2 cache. |
cache3 | Specify the parameter value equal to the amount of the corresponding parameter units that share an L3 cache. |
cache | The largest value among cache1, cache2, and cache3. |
socket | sock | Specify the parameter value equal to the amount of the corresponding parameter units contained in one physical package/socket. |
half | mid | Specify the parameter value equal to socket/2. |
third | Specify the parameter value equal to socket/3. |
quarter | Specify the parameter value equal to socket/4. |
octavo | Specify the parameter value equal to socket/8. |
Syntax
I_MPI_PIN_PROCESSOR_LIST=[<procset>][:map=<map>]
Arguments
<map> | The mapping pattern used for main thread placement. |
bunch | The main threads are mapped as close as possible on the sockets. |
scatter | The main threads are mapped as remotely as possible so as not to share common resources: FSB, caches, and core. |
spread | The main threads are mapped consecutively with the possibility not to share common resources. |
Description
Set the I_MPI_PIN_PROCESSOR_LIST environment variable to define the processor placement. To avoid conflicts with different shell versions, the environment variable value may need to be enclosed in quotes.
This environment variable is valid only if I_MPI_PIN is enabled.
The I_MPI_PIN_PROCESSOR_LIST environment variable has the following different syntax variants:
Explicit processor list. This comma-separated list is defined in terms of logical processor numbers. The relative node rank of a main thread is an index to the processor list such that the i-th main thread is pinned on i-th list member. This permits the definition of any main thread placement on the CPUs.
For example, main thread mapping for I_MPI_PIN_PROCESSOR_LIST=p0,p1,p2,...,pn is as follows:
Rank on a node | 0 | 1 | 2 | ... | n-1 | N |
Logical CPU | p0 | p1 | p2 | ... | pn-1 | Pn |
grain/shift/offset mapping. This method provides cyclic shift of a defined grain along the processor list with steps equal to shift*grain and a single shift on offset*grain at the end. This shifting action is repeated shift times.
For example: grain = 2 logical processors, shift = 3 grains, offset = 0.
Legend:
gray - MPI main thread grains
A) red - processor grains chosen on the 1st pass
B) cyan - processor grains chosen on the 2nd pass
C) green - processor grains chosen on the final 3rd pass
D) Final map table ordered by MPI ranks
A)
0 1 | 2 3 | ... | 2n-2 2n-1 | ||||||
0 1 | 2 3 | 4 5 | 6 7 | 8 9 | 10 11 | ... | 6n-6 6n-5 | 6n-4 6n-3 | 6n-2 6n-1 |
B)
0 1 | 2n 2n+1 | 2 3 | 2n+2 2n+3 | ... | 2n-2 2n-1 | 4n-2 4n-1 | |||
0 1 | 2 3 | 4 5 | 6 7 | 8 9 | 10 11 | ... | 6n-6 6n-5 | 6n-4 6n-3 | 6n-2 6n-1 |
C)
0 1 | 2n 2n+1 | 4n 4n+1 | 2 3 | 2n+2 2n+3 | 4n+2 4n+3 | ... | 2n-2 2n-1 | 4n-2 4n-1 | 6n-2 6n-1 |
0 1 | 2 3 | 4 5 | 6 7 | 8 9 | 10 11 | ... | 6n-6 6n-5 | 6n-4 6n-3 | 6n-2 6n-1 |
D)
0 1 | 2 3 | … | 2n-2 2n-1 | 2n 2n+1 | 2n+2 2n+3 | … | 4n-2 4n-1 | 4n 4n+1 | 4n+2 4n+3 | … | 6n-2 6n-1 |
0 1 | 6 7 | … | 6n-6 6n-5 | 2 3 | 8 9 | … | 6n-4 6n-3 | 4 5 | 10 11 | … | 6n-2 6n-1 |
Predefined mapping scenario. In this case popular main thread pinning schemes are defined as keywords selectable at runtime. There are two such scenarios: bunch and scatter.
In the bunch scenario the main threads are mapped proportionally to sockets as closely as possible. This mapping makes sense for partial processor loading. In this case the number of main threads is less than the number of processors.
In the scatter scenario the main threads are mapped as remotely as possible so as not to share common resources: FSB, caches, and cores.
In the example, there are two sockets, four cores per socket, one logical CPU per core, and two cores per shared cache.
Legend:
gray - MPI main threads
cyan - 1st socket processors
green - 2nd socket processors
Same color defines a processor pair sharing a cache
0 | 1 | 2 | 3 | 4 | ||||
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
bunch scenario for 5 processes
0 | 4 | 2 | 6 | 1 | 5 | 3 | 7 | |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
scatter scenario for full loading
To pin the main thread to CPU0 and CPU3 on each node globally, use the following command:
> mpiexec -genv I_MPI_PIN_PROCESSOR_LIST=0,3 -n <# of main threads> <executable>
To pin the main thread to different CPUs on each node individually (CPU0 and CPU3 on host1 and CPU0, CPU1 and CPU3 on host2), use the following command:
> mpiexec -host host1 -env I_MPI_PIN_PROCESSOR_LIST=0,3 -n <# of main threads> <executable> :^ -host host2 -env I_MPI_PIN_PROCESSOR_LIST=1,2,3 -n <# of main threads> <executable>
To print extra debug information about the main thread pinning, use the following command:
> mpiexec -genv I_MPI_DEBUG=4 -m -host host1 -env I_MPI_PIN_PROCESSOR_LIST=0,3 -n <# of main threads> <executable> :^ -host host2 -env I_MPI_PIN_PROCESSOR_LIST=1,2,3 -n <# of main threads> <executable>
Note
If the number of main threads is greater than the number of CPUs used for pinning, the thread list is wrapped around to the start of the processor list.