IMB-IO Non-blocking Benchmarks#
Intel(R) MPI Benchmarks implements blocking and nonblocking modes of the
IMB-IO benchmarks as different benchmark flavors. The Read
and
Write
components of the blocking benchmark name are replaced for
nonblocking flavors by IRead
and IWrite
, respectively.
The definitions of blocking and nonblocking flavors are identical, except for their behavior in regard to:
Aggregation. The nonblocking versions only run in the non-aggregate mode.
Synchronism. Only the meaning of an elementary transfer differs from the equivalent blocking benchmark.
Basically, an elementary transfer looks as follows:
time = MPI_Wtime()
for ( i=0; i<n_sample; i++ )
{
Initiate transfer
Exploit CPU
Wait for the end of transfer
}
time = (MPI_Wtime()-time)/n_sample
The Exploit CPU
section in the above example is arbitrary. Intel(R)
MPI Benchmarks exploits CPU as described below.
Exploiting CPU#
Intel(R) MPI Benchmarks uses the following method to exploit the CPU. A kernel loop is executed repeatedly. The kernel is a fully vectorizable multiplication of a 100x100 matrix with a vector. The function is scalable in the following way:
IMB_cpu_exploit(float desired_time, int initialize);
The input value of desired_time
determines the time for the function
to execute the kernel loop, with a slight variance. At the very
beginning, the function is called with initialize=1
and an input
value for desired_time
. This determines an Mflop/s rate and a timing
t_CPU
, as close as possible to desired_time
, obtained by running
without any obstruction. During the actual benchmarking,
IMB_cpu_exploit
is called with initialize=0
, concurrently with
the particular I/O action, and always performs the same type and number
of operations as in the initialization step.
Displaying Results#
Three timings are crucial to interpret the behavior of nonblocking I/O, overlapped with CPU exploitation:
t_pure
is the time for the corresponding pure blocking I/O action, non-overlapping with CPU activityt_CPU
is the time theIMB_cpu_exploit
periods (running concurrently with nonblocking I/O) would use when running dedicatedt_ovrl
is the time for the analogous nonblocking I/O action, concurrent with CPU activity (exploitingt_CPU
when running dedicated)
A perfect overlap means: t_ovrl = max(t_pure,t_CPU)
No overlap means: t_ovrl = t_pure+t_CPU
.
The actual amount of overlap is:
overlap=(t_pure+t_CPU-t_ovrl)/min(t_pure,t_CPU)
(*)
The Intel(R) MPI Benchmarks result tables report the timings
t_ovrl, t_pure, t_CPU
and the estimated overlap obtained by the (*)
formula above. At the beginning of a run, the Mflop/s rate is
corresponding to the t_CPU
displayed.