IMB-IO Non-blocking Benchmarks#
Intel(R) MPI Benchmarks implements blocking and nonblocking modes of the
IMB-IO benchmarks as different benchmark flavors. The Read and
Write components of the blocking benchmark name are replaced for
nonblocking flavors by IRead and IWrite, respectively.
The definitions of blocking and nonblocking flavors are identical, except for their behavior in regard to:
Aggregation. The nonblocking versions only run in the non-aggregate mode.
Synchronism. Only the meaning of an elementary transfer differs from the equivalent blocking benchmark.
Basically, an elementary transfer looks as follows:
time = MPI_Wtime()
for ( i=0; i<n_sample; i++ )
{
Initiate transfer
Exploit CPU
Wait for the end of transfer
}
time = (MPI_Wtime()-time)/n_sample
The Exploit CPU section in the above example is arbitrary. Intel(R)
MPI Benchmarks exploits CPU as described below.
Exploiting CPU#
Intel(R) MPI Benchmarks uses the following method to exploit the CPU. A kernel loop is executed repeatedly. The kernel is a fully vectorizable multiplication of a 100x100 matrix with a vector. The function is scalable in the following way:
IMB_cpu_exploit(float desired_time, int initialize);
The input value of desired_time determines the time for the function
to execute the kernel loop, with a slight variance. At the very
beginning, the function is called with initialize=1 and an input
value for desired_time. This determines an Mflop/s rate and a timing
t_CPU, as close as possible to desired_time, obtained by running
without any obstruction. During the actual benchmarking,
IMB_cpu_exploit is called with initialize=0, concurrently with
the particular I/O action, and always performs the same type and number
of operations as in the initialization step.
Displaying Results#
Three timings are crucial to interpret the behavior of nonblocking I/O, overlapped with CPU exploitation:
t_pureis the time for the corresponding pure blocking I/O action, non-overlapping with CPU activityt_CPUis the time theIMB_cpu_exploitperiods (running concurrently with nonblocking I/O) would use when running dedicatedt_ovrlis the time for the analogous nonblocking I/O action, concurrent with CPU activity (exploitingt_CPUwhen running dedicated)
A perfect overlap means: t_ovrl = max(t_pure,t_CPU)
No overlap means: t_ovrl = t_pure+t_CPU.
The actual amount of overlap is:
overlap=(t_pure+t_CPU-t_ovrl)/min(t_pure,t_CPU)(*)
The Intel(R) MPI Benchmarks result tables report the timings
t_ovrl, t_pure, t_CPU and the estimated overlap obtained by the (*)
formula above. At the beginning of a run, the Mflop/s rate is
corresponding to the t_CPU displayed.