Command-line Control#
You can control all the aspects of the Intel(R) MPI Benchmarks through the command line. The general command-line syntax is the following:
IMB-MPI1 [-h{elp}]
[-npmin <P_min>]
[-multi <outflag>]
[-off_cache <cache_size[,cache_line_size]>
[-iter <msgspersample[,overall_vol[,msgs_nonaggr[,iter_policy]]]>]
[-iter_policy <iter_policy>]
[-time <max_runtime per sample>]
[-mem <max. mem usage per process>]
[-msglen <Lengths_file>]
[-map <PxQ>]
[-input <filename>]
[-include] [benchmark1[,benchmark2[,...]]]
[-exclude] [benchmark1[,benchmark2[,...]]]
[-msglog [<minlog>:]<maxlog>]
[-thread_level <level>]
[-sync <mode>]
[-root_shift <mode>] [-imb_barrier]
[benchmark1 [,benchmark2 [,...]]]
The command line is repeated in the output. The options may appear in any order.
Examples:
Get out-of-cache data for PingPong
:
mpirun -np 2 IMB-MPI1 PingPong -off_cache -1
Run a very large configuration, with the following parameters:
Maximum iterations: 20
Maximum run time per message: 1.5 seconds
Maximum message buffer size: 2 GBytes
mpirun -np 512 IMB-MPI1 -npmin 512 alltoallv -iter 20 -time 1.5 -mem 2
Run the P_Read_shared
benchmark with the minimum number of processes
set to seven:
mpirun -np 14 IMB-IO P_Read_shared -npmin 7
Run the IMB-MPI1
benchmarks including PingPongAnySource
and
PingPingAnySource
, but excluding the Alltoall
and Alltoallv
benchmarks. Set the transfer message sizes as
0, 4, 8, 16, 32, 64, 128
:
mpirun -np 16 IMB-MPI1 -msglog 2:7 -include PingPongAnySource PingPingAnySource -exclude Alltoall Alltoallv
Run the PingPong
, PingPing
, PingPongAnySource
, and
PingPingAnySource
benchmarks with the transfer message sizes
0, 2^0, 2^1, 2^2, ..., 2^16
:
mpirun -np 4 IMB-MPI1 -msglog 16 PingPong PingPing PingPongAnySource PingPingAnySource
Benchmark Selection Arguments#
Benchmark selection arguments are a sequence of blank-separated strings. Each string is the name of a benchmark in exact spelling, case insensitive.
For example, the string IMB-MPI1 PingPong Allreduce
specifies that
you want to run PingPong
and Allreduce
benchmarks only:
mpirun -np 10 IMB-MPI1 PingPong Allreduce
By default, all benchmarks of the selected component are run.
-npmin Option#
Specifies the minimum number of processes P_min
to run all selected
benchmarks on. The P_min
value after -npmin
must be an integer.
Given P_min
, the benchmarks run on the processes with the numbers
selected as follows:
P_min, 2P_min, 4P_min, ..., ``\ largest ``2xP_min <P, P
You may set P_min
to 1. If you set P_min > P
, Intel MPI
Benchmarks interprets this value as P_min = P.
For example, to run the IMB-EXT
benchmarks with minimum number of
processes set to five, call:
mpirun -np 11 IMB-EXT -npmin 5
By default, all active processes are selected as described in the Running Intel(R) MPI Benchmarks section.
-multi Option#
Defines whether the benchmark runs in multiple mode. In this mode
MPI_COMM_WORLD
is split into several groups, which run
simultaneously. The argument after -multi
is a meta-symbol
<outflag>
that can take an integer value of 0 or 1:
outflag = 0
display only maximum timings (minimum throughputs) over all active groupsoutflag = 1
report on all groups separately. The report may be long in this case.
This flag controls only benchmark results output style, the running
procedure is the same for both -multi 0
and -multi 1
options.
When the number of processes running the benchmark is more than half of
the overall number of ranks in MPI_COMM_WORLD
, the multiple mode
benchmark execution coincides with the non-multiple one, as not more
than one process group can be created.
For example, if you run this command:
mpirun -np 16 IMB-MPI1 -multi 0 bcast -npmin 12
The benchmark will run in non-multiple mode, as the benchmarking starts
from 12 processes, which is more than half of MPI_COMM_WORLD
.
When a benchmark is set to be run on a set of different numbers of
processes, its launch mode is determined based on the number of
processes for each run. It is easy to tell if the benchmark is running
in multiple mode by looking at the benchmark results header. When the
name of the benchmark is printed out with the Multi-
prefix, it is a
multiple mode run.
For example, in the case of the same Bcast
benchmark execution
without –npmin
parameter:
mpirun -np 16 IMB-MPI1 -multi 0 bcast
the benchmark will be executed 4 times: for 2, 4 and 8 processes in multiple mode, and for 16 processes in standard (non-multiple) mode. The benchmark results headers will look as follows:
#----------------------------------------------------------------
# Benchmarking Multi-Bcast
# ( 8 groups of 2 processes each running simultaneous )
# Group 0: 0 1
#
# Group 1: 2 3
#
# Group 2: 4 5
#
# Group 3: 6 7
#
# Group 4: 8 9
#
# Group 5: 10 11
#
# Group 6: 12 13
#
# Group 7: 14 15
#
…
#----------------------------------------------------------------
# Benchmarking Multi-Bcast
# ( 4 groups of 4 processes each running simultaneous )
# Group 0: 0 1 2 3
#
# Group 1: 4 5 6 7
#
# Group 2: 8 9 10 11
#
# Group 3: 12 13 14 15
#
…
#----------------------------------------------------------------
# Benchmarking Multi-Bcast
# ( 2 groups of 8 processes each running simultaneous )
# Group 0: 0 1 2 3 4 5 6 7
#
# Group 1: 8 9 10 11 12 13 14 15
#
…
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 16
For each but the last execution the header contains:
Multi-
prefix before the benchmark nameThe list of
MPI_COMM_WORLD
ranks, aggregated in each group
By default, Intel(R) MPI Benchmarks run non-multiple benchmark flavors.
-off_cache cache_size[,cache_line_size] Option#
Use the -off_cache
flag to avoid cache re-use. If you do not use
this flag (default), the same communications buffer is used for all
repetitions of one message size sample. In this case, Intel(R) MPI
Benchmarks reuses the cache, so throughput results might be
non-realistic.
The argument after off_cache
can be a single number
(cache_size)
, two comma-separated numbers
(cache_size,cache_line_size)
, or -1
:
cache_size
is a float for an upper bound of the size of the last level cache, in MB.cache_line_size
is assumed to be the size of a last level cache line (can be an upper estimate).-1
uses values defined inIMB_mem_info.h
. In this case, make sure to define values forcache_size
andcache_line_size
inIMB_mem_info.h.
The sent/received data is stored in buffers of size ~2x
MAX(cache_size, message_size)
. When repetitively using messages
of a particular size, their addresses are advanced within those buffers
so that a single message is at least 2 cache lines after the end of the
previous message. When these buffers are filled up, they are reused from
the beginning.
-off_cache
is effective for IMB-MPI1 and IMB-EXT. Avoid using this
option for IMB-IO.
Examples:
Use the default values defined in IMB_mem_info.h
:
-off_cache -1
2.5 MB last level cache, default line size:
-off_cache 2.5
16 MB last level cache, line size 128:
-off_cache 16,128
The off_cache
mode might also be influenced by eventual internal
caching with the Intel(R) MPI Library. This could make results
interpretation complicated.
Default: no cache control.
-iter Option#
Use this option to control the number of iterations executed by every benchmark.
By default, the number of iterations is controlled through parameters
MSGSPERSAMPLE
, OVERALL_VOL
, MSGS_NONAGGR
, and
ITER_POLICY
defined in IMB_settings.h
.
You can optionally add one or more arguments after the -iter
flag,
to override the default values defined in IMB_settings.h
. Use the
following guidelines for the optional arguments:
To override the
MSGSPERSAMPLE
value, use a single integer.To override the
OVERALL_VOL
value, use two comma-separated integers. The first integer defines theMSGSPERSAMPLE
value. The second integer overrides theOVERALL_VOL
value.To override the
MSGS_NONAGGR
value, use three comma-separated integer numbers. The first integer defines theMSGSPERSAMPLE
value. The second integer overrides theOVERALL_VOL
value. The third overrides theMSGS_NONAGGR
value.To override the
-iter_policy
argument, enter it after the integer arguments, or right after the-iter
flag if you do not use any other arguments.
Examples:
To define MSGSPERSAMPLE
as 2000, and OVERALL_VOL
as 100, use the
following string:
-iter 2000,100
To define MSGS_NONAGGR
as 150, you need to define values for
MSGSPERSAMPLE
and OVERALL_VOL
as shown in the following string:
-iter 1000,40,150
To define MSGSPERSAMPLE
as 2000 and set the multiple_np
policy,
use the following string (see -iter_policy
):
-iter 2000,multiple_np
-iter_policy Option#
Use this option to set a policy for automatic calculation of the number
of iterations. Use one of the following arguments to override the
default ITER_POLICY
value defined in IMB_settings.h
:
Policy |
Description |
|
Reduces the number of iterations when the maximum run time per sample (see |
|
Reduces the number of iterations when the message size is getting bigger. Using this policy ensures the accuracy of the results, but may lead to longer execution time. You can control the execution time through the |
|
Automatically chooses which policy to use: - applies |
|
The number of iterations does not change during the execution. |
You can also set the policy through the -iter
option. See -iter
.
By default, the ITER_POLICY
defined in IMB_settings.h
is used.
-time Option#
Specifies the number of seconds for the benchmark to run per message
size. The argument after -time
is a floating-point number.
The combination of this flag with the -iter
flag or its default
alternative ensures that the Intel(R) MPI Benchmarks always chooses the
maximum number of repetitions that conform to all restrictions.
A rough number of repetitions per sample to fulfill the -time
request is estimated in preparatory runs that use ~1 second overhead.
Default: -time
is activated. The floating-point value specifying the
run-time seconds per sample is set in the SECS_PER_SAMPLE
variable
defined in IMB_settings.h
, or IMB_settings_io.h
.
-mem Option#
Specifies the number of GB to be allocated per process for the message buffers. If the size is exceeded, a warning is returned, stating how much memory is required for the overall run.
The argument after -mem
is a floating-point number.
Default: the memory is restricted by MAX_MEM_USAGE
defined in
IMB_mem_info.h
.
-input <File> Option#
Use the ASCII input file to select the benchmarks. For example, the
IMB_SELECT_EXT
file looks as follows:
#
# IMB benchmark selection file
#
# Every line must be a comment (beginning with #), or it
# must contain exactly one IMB benchmark name
#
#Window
Unidir_Get
#Unidir_Put
#Bidir_Get
#Bidir_Put
Accumulate
With the help of this file, the following command runs only
Unidir_Get
and Accumulate
benchmarks of the IMB-EXT
component:
mpirun .... IMB-EXT -input IMB_SELECT_EXT
-msglen <File> Option#
Enter any set of non-negative message lengths to an ASCII file, line by line, and call the Intel(R) MPI Benchmarks with arguments:
-msglen Lengths
The Lengths
value overrides the default message lengths. For IMB-IO,
the file defines the I/O portion lengths.
-map PxQ Option#
Use this option to re-number the ranks for parallel processes in
MPI_COMM_WORLD
along rows of the matrix:
0 |
P |
… |
(Q-2)P |
(Q-1)P |
1 |
||||
… |
||||
P-1 |
2P-1 |
(Q-1)P-1 |
QP-1 |
For example, to run Multi-PingPong
between two nodes, P processes
on each (ppn=P
), with each process on one node communicating with
its counterpart on the other, call:
mpirun -np <2P> IMB-MPI1 -map <P>x2 -multi 0 PingPong
or:
mpirun -np <2P> IMB-MPI1 -map
x2 –multi 1 PingPong
The P*Q product must not be less than the total number of ranks,
otherwise a command line parsing error is issued. The P=1
and
Q=1
cases are treated as meaningless and are just ignored.
See the examples below for a more detailed explanation of the –map
option.
Example 1. PingPong
benchmark with a 4x2 map, 8 ranks in total
on 2 nodes.
a) –map 4x2
combined with –multi <outflag>
, multiple mode:
mpirun -np 8 IMB-MPI1 -map 4x2 –multi 0 PingPong
The MPI_COMM_WORLD
communicator originally consists of 8 ranks:
{ 0, 1, 2, 3, 4, 5, 6, 7 }
The given option –map 4x2
reorders this set of ranks into the
following set (in terms of MPI_COMM_WORLD ranks
):
{ 0, 4, 1, 5, 2, 6, 3, 7 }
The –multi <outflag>
makes Intel(R) MPI Benchmarks split the
communicator into 4 subgroups, 2 ranks in each, with a
MPI_Comm_split
call. As a result, the communicator looks like this:
{ { 0, 1 }, { 0, 1 }, { 0, 1 }, { 0, 1 } }
In terms of the original MPI_COMM_WORLD
rank numbers, this means
that there are 4 groups of ranks, and the benchmark is executed
simultaneously for each:
Group 1: { 0, 4 }; Group 2: { 1, 5 }; Group 3: { 2, 6 }; Group 4: { 3, 7 }
This grouping is shown in the benchmark output header and can be easily verified:
#-----------------------------------------------------------------------------
# Benchmarking Multi-PingPong
# ( 4 groups of 2 processes each running simultaneous )
# Group 0: 0 4
#
# Group 1: 1 5
#
# Group 2: 2 6
#
# Group 3: 3 7
As can be seen in the output, ranks in the pairs belong to different nodes, so this benchmark execution will measure inter-node communication parameters.
b) –map 4x2
without –multi <outflag>
, non-multiple mode:
mpirun -np 8 IMB-MPI1 -map 4x2 PingPong
The same rules or rank numbers transformation are applied in this case,
but since the multiple mode is not set, communicator splitting is not
performed. Only two ranks will participate in actual communication, as
the PingPong
benchmark covers a pair of ranks only. The benchmark
will cover only the first group:
Group: { 0, 4 }
and the other ranks from MPI_COMM_WORLD
will be idle. This is
reflected in the benchmark results output:
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2; rank order (rowwise):
# 0 4
#
# ( 6 additional processes waiting in MPI_Barrier)
Example 2. Biband
benchmark with the 2x4 map, 8 ranks in total
on 2 nodes
a) –map 2x4
combined with –multi <outflag>
, multiple mode:
mpirun -np 8 IMB-MPI1 -map 2x4 –multi 0 Biband
The MPI_COMM_WORLD
communicator originally consists of 8 ranks:
{ 0, 1, 2, 3, 4, 5, 6, 7 }
The given option –map 2x4
reorders this set of ranks into the
following set (in terms of MPI_COMM_WORLD
ranks):
{ 0, 2, 4, 6, 1, 3, 5, 7 }
The communicator splitting, which is required by the
–multi <outflag>
option, then depends on the number of processes
to be used for execution. In this case, 2-process, 4-process and
8-process run cycles will be executed:
1) NP=2
: Reordered communicator is split into 4 groups of 2
processes because of the multiple mode:
{ { 0, 1}, { 0, 1}, { 0, 1 }, { 0, 1 } }
In terms of MPI_COMM_WORLD
ranks, the groups are:
Group 1: { 0, 2 }; Group 2: { 1, 3 }; Group 3: { 4, 6 }; Group 3: { 5, 7 }
#---------------------------------------------------
# Benchmarking Multi-Biband
# ( 4 groups of 2 processes each running simultaneous )
# Group 0: 0 2
#
# Group 1: 1 3
#
# Group 2: 4 6
#
# Group 3: 5 7
#
All the pairs belong to a single node here, so no cross-node benchmarking is performed in this case.
2) NP=4
: Reordered communicator is split into 2 groups of 4
processes because of the multiple mode:
{ { 1, 2, 3, 4 }, { 1, 2, 3, 4 } }
In terms of MPI_COMM_WORLD
ranks, the groups are:
Group 1: { 0, 2, 4, 6 }; Group 2: { 1, 3, 5, 7 };
#---------------------------------------------------
# Benchmarking Multi-Biband
# ( 2 groups of 4 processes each running simultaneous )
# Group 0: 0 2 4 6
#
# Group 1: 1 3 5 7
#
Execution groups mix ranks from different nodes in this case, and due to
the Biband
benchmark pairs ordering rules (see
Biband), only inter-node pairs will be tested.
3) NP=8
: No communicator splitting can be performed, since the
ranks can fit only a single group:
Group: { 0, 2, 4, 6, 1, 3, 5, 7 }
#---------------------------------------------------
# Benchmarking Biband
# #processes = 8; rank order (rowwise):
# 0 2 4 6
#
# 1 3 5 7
#
The group is half-by-half spread within 2 execution nodes, but as a
result of reordering all the pairs in the Biband
test (see
Biband) appear to be intra-node ones, which is totally
opposite to the default case (no –map
option) and the NP=4
case.
b) –map 2x4
without –multi <outflag>
option, non-multiple
mode:
mpirun -np 8 IMB-MPI1 -map 2x4 Biband
The same rules or rank numbers transformation are applied in this case, but since the multiple mode is not set, no communicator splitting is performed. The set of ranks that are covered by the benchmark depends on the number of processes to be used for execution. In this case, 2-process, 4-process and 8-process run cycles will be executed, and they just use the first 2, 4 and 8 ranks of the reordered communicator for actual benchmark execution:
1) NP=2
: first 2 ranks of the reordered communicator form the
group (in terms of MPI_COMM_WORLD
ranks):
Group: { 0, 2 };
#---------------------------------------------------
# Benchmarking Biband
# #processes = 2; rank order (rowwise):
# 0 2
#
# ( 6 additional processes waiting in MPI_Barrier)
2) NP=4
: first 4 ranks of the reordered communicator form the
group (in terms of MPI_COMM_WORLD
ranks):
Group: { 0, 2, 4, 6 };
#---------------------------------------------------
# Benchmarking Biband
# #processes = 4; rank order (rowwise):
# 0 2 4 6
#
# ( 4 additional processes waiting in MPI_Barrier)
3) NP=8
: all the ranks of the reordered communicator form the
group (in terms of MPI_COMM_WORLD
ranks):
Group: { 0, 2, 4, 6, 1, 3, 5, 7 }
#---------------------------------------------------
# Benchmarking Biband
# #processes = 8; rank order (rowwise):
# 0 2 4 6
#
# 1 3 5 7
#
As can be seen in the output, the NP=2
and NP=4
executions of
the Biband
test launched with and without the –multi <outflag>
option are almost the same. The only difference is that in the
non-multiple mode only one group is active, and all other processes are
idle. For the NP=8
case, the Biband
benchmark executions
performed with and without the –multi <outflag>
option are
completely identical.
-include [[benchmark1] benchmark2 …]#
Specifies the list of additional benchmarks to run. For example, to add
PingPongAnySource
and PingPingAnySource
benchmarks, call:
mpirun -np 2 IMB-MPI1 -include PingPongAnySource PingPingAnySource
-exclude [[benchmark1] benchmark2 …]#
Specifies the list of benchmarks to be excluded from the run. For
example, to exclude Alltoall
and Allgather
, call:
mpirun -np 2 IMB-MPI1 -exclude Alltoall Allgather
-msglog [<minlog>:]<maxlog>#
This option allows you to control the lengths of the transfer
messages. This setting overrides the MINMSGLOG
and MAXMSGLOG
values. The new message sizes are 0, 2^minlog, ..., 2^maxlog
.
For example, if you run the following command line:
mpirun -np 2 IMB-MPI1 -msglog 3:7 PingPong
Intel(R) MPI Benchmarks selects the lengths 0, 8, 16, 32, 64, 128
,
as shown below:
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[μsec] Mbytes/sec
0 1000 0.70 0.00
8 1000 0.73 10.46
16 1000 0.74 20.65
32 1000 0.94 32.61
64 1000 0.94 65.14
128 1000 1.06 115.16
Alternatively, you can specify only the maxlog
value, enter:
mpirun -np 2 IMB-MPI1 -msglog 3 PingPong
In this case Intel(R) MPI Benchmarks selects the lengths 0,1,2,4,8
:
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[μsec] Mbytes/sec
0 1000 0.69 0.00
1 1000 0.72 1.33
2 1000 0.71 2.69
4 1000 0.72 5.28
8 1000 0.73 10.47
-thread_level Option#
This option specifies the desired thread level for
MPI_Init_thread()
. See description of MPI_Init_thread()
for
details. The option is available only if the Intel(R) MPI Benchmarks is
built with the USE_MPI_INIT_THREAD
macro defined. Possible values
for <level>
are single
, funneled
, serialized
, and
multiple
.
-sync Option#
This option is relevant only for benchmarks measuring collective
operations. It controls whether all ranks are synchronized after every
iteration step by means of the MPI_Barrier
operation. The -sync
option can take the following arguments:
Argument |
Description |
0 | off | disable | no |
Disables processes synchronization at each iteration step. |
1 | on | enable | yes |
Enables processes synchronization at each iteration step. This is the default value. |
-imb_barrier Option#
Implementation of the MPI_Barrier
operation may vary depending on
the MPI implementation. Each MPI implementation might use a different
algorithm for the barrier, with possibly different synchronization
characteristics, so the Intel(R) MPI Benchmarks results may vary
significantly as a result of MPI_Barrier
implementation differences.
The internal, MPI-independent barrier function IMB_barrier
is
provided to make the synchronization effect more reproducible.
Use this option to use the IMB_barrier
function to get consistent
results of collective operation benchmarks.
Argument |
Description |
0 | off | disable | no |
Use the standard |
1 | on | enable | yes |
Use the internal barrier implementation for synchronization. |
-root_shift Option#
This option is relevant only for benchmarks measuring collective
operations that utilize the root concept (for example MPI_Bcast
,
MPI_Reduce
, MPI_Gather
, etc). It defines whether the root is
changed at every iteration step or not. The –root_shift
option can
take the following arguments:
Argument |
Description |
0 | off | disable | no |
Disables root change at each iteration step. Rank 0 acts as a root at each iteration step. This is the default value. |
1 | on | enable | yes |
Enables root change at each iteration step. Root rank is changed in a round-robin fashion. |
-data_type Option#
Specifies the type to be used. The -data_type
option can take
byte
, char
, int
, float
, float16
, or bfloat16
argument. The default value is
byte
.
The option is available for MPI-1 only.
-red_data_type Option#
Specifies the type of reduction to be used. The -red_data_type
option can take char
, int
, float
, float16
, or bfloat16
argument. The default
value is float
.
The option is available for MPI-1 only.
-contig_type Option#
Specifies the predefined type to be used.
Argument |
Description |
base |
A simple MPI type (for example, |
base_vec |
A vector of base |
resize |
A simple MPI type with an extent (type) = 2*size (type) |
resize_vec |
A vector of resize |
The option is available for MPI-1 only.
-zero_size Option#
Do not run benchmarks with the message size 0.
Argument |
Description |
0 | off | disable | no |
Allows to run benchmarks with the zero message size. |
1 | on | enable | yes |
Does not allow to run benchmarks with the zero message size. This is the default value. |
The option is available for MPI-1 only.
-mem_alloc_type Option#
Argument |
Description |
device |
Allocates device memory. This is the default value. |
host |
Allocates host memory registered on GPU device. |
shared |
Allocates shared memory. |
cpu |
Allocates host memory. |
The option is available for MPI-1 with GPU support only.