Use Intel® VTune™ Profiler to identify imbalances and communication issues in MPI-enabled applications.
Content experts: Rupak Roy, Xiao Zhu
DIRECTIONS:
This section lists the hardware and software tools used for the performance analysis scenario.
Application: heart_demo sample application
Tools:
Intel® C++ Compiler
Intel® MPI Library 2021.11
Intel VTune Profiler 2024.0 or newer
Intel VTune Profiler - Application Performance Snapshot
Get a free download of the Intel® MPI Library from https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html.
Operating system: Linux*
CPU: Intel® Xeon® Platinum 8480+ Processor (formerly code named Sapphire Rapids)
Build your application with debug symbols so Intel VTune Profiler can correlate performance data with your source code and assembly.
Clone the application GitHub repository to your local system:
git clone https://github.com/CardiacDemo/Cardiac_demo.git
Set up the Intel C++ Compiler and Intel® MPI Library environment:
source <compiler_install_dir>/oneapi/setvars.sh
In the root level of the sample package, create a build directory and open it:
mkdir build
cd build
Build the application:
7. mpiicpx ../heart_demo.cpp ../luo_rudy_1991.cpp ../rcm.cpp ../mesh.cpp -g -o heart_demo -O3 -std=c++17 -qopenmp -parallel-source-info=2
The executable heart_demo should be present in the current directory.
Start tuning your MPI application by examining a snapshot of its performance, collected by Application Performance Snapshot in VTune Profiler. With this snapshot, you can understand the general properties of your application. Then focus on problematic areas using appropriate tools.
We begin by preparing a performance snapshot on a set of dual socket nodes using the Intel® Xeon® Scalable processor (code named Sapphire Rapids). This example uses Intel® Xeon® Platinum 8480+ Processor with 24 cores per socket. This processor configures the run to have 4 MPI ranks per node and 12 threads per rank. Modify the specific rank and thread counts in this example to match your own system specification.
To obtain a performance snapshot on four nodes, run this command in an interactive session or in a batch script :
export OMP_NUM_THREADS=12
mpirun -np 16 -ppn 4 aps ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 100
When the analysis is complete, you can find profiling data in a directory named aps_result_YYYYMMDD, where the date of collection is included in YYYY/MM/DD format.
For example, to produce a single page HTML snapshot of the results collected on December 5 2023, type:
aps --report ./aps_result_20231205
The aps_report_YYYYMMDD_<stamp>.html file is created in your working directory, where the <stamp> number is used to prevent overwriting existing reports. The report contains information on overall performance, including:
A note at the top of the report highlights the main areas of concern for the application.
The snapshot indicates that this application is bound overall by MPI communication. The application also suffers from:
This snapshot result points to complex issues in the code. To continue investigating the performance issues and isolate the problems, let us run the HPC Performance Characterization analysis in VTune Profiler next.
Most clusters are setup with login and compute nodes. Typically a user connects to a login node and uses a scheduler to submit a job to the compute nodes, where it executes. In a cluster environment, the most practical way to run VTune Profiler to profile an MPI application is by using the command line for data collection and the GUI for performance analysis, once the job has completed.
To report MPI-related metrics in a distributed environment, type:
<mpi launcher> [options] vtune [options] -r <results dir> -- <application> [arguments]Follow these steps to run the HPC Performance Characterization analysis in VTune Profiler from the command line:
Prepare your environment by sourcing the VTune Profiler files. For a default installation using the bash shell, use this command:
source /opt/intel/vtune_Profiler/vars.sh
Collect data for the heart_demo application using the hpc-performance analysis. The application uses both OpenMP and MPI. The application execution uses the configuration described earlier, with 16 MPI ranks over a total of 4 compute nodes using the Intel® MPI Library. This example is run on Intel® Xeon® Platinum 8480 Processors and uses 12 OpenMP threads per MPI rank:
export OMP_NUM_THREADS=12
mpirun -np 16 -ppn 4 vtune -collect hpc-performance –r vtune_mpi -- ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 100
The analysis begins and generates four output directories using the following naming convention: vtune_mpi.<node host name>.
You can select specific MPI ranks to be profiled while running others simultaneously, but without collecting profiling data. For details, see Selective MPI Rank Profiling.
Open one of the collected results in the VTune Profiler user interface:
vtune-gui ./vtune_mpi.node_1
To display the Intel VTune Profiler GUI, you need an X11 manager running on the local system or a VNC session connected to the system. Since each system is different, consult with your local administrator for a recommended method.
The result opens in Intel VTune Profiler and shows the Summary window. This window provides an overview of the application performance. Because heart_demo is an MPI parallel application, the Summary window shows MPI Imbalance information and details regarding the MPI rank in the execution critical path in addition to the usual metrics.
In our example, there is some imbalance and also a significant amount of time spent in serial regions of the code (not shown in the figure).
While you can collect profiles across nodes, the only way to view all MPI data is to load each node result independently. For detailed MPI traces, use Intel® Trace Analyzer and Collector.
In Intel VTune Profiler 2024.0 (and newer versions), the Summary window contains histograms of bandwidth utilization. The metrics show bandwidth and packet rate and indicate the percentage of the execution time for which the code was bound by high bandwidth or packet rate utilization. The histogram shows a maximum DRAM bandwidth utilization of 6 GB/s, which is low. This tells us that there is still room for improvement.
Switch to the Bottom-up tab to get more details. Set the Grouping to have Process at the top level. You should see this view:
Since this code uses both MPI and OpenMP, the Bottom-up window shows metrics related to both runtimes, in addition to the CPU and memory data. In our example, the OpenMP* Imbalance metric is highlighted in red. This hints that threading improvements could help performance.
Review the execution timeline for several metrics at the bottom of the Bottom-up window, including DDR and MCDRAM bandwidth, as well as CPU time. The UPI bandwidth timeline for this code shows continuous utilization at a moderate bandwidth (the scale is in GB/s).
Of more interest is the detailed execution time per thread and the breakdown of these metrics:
In this case you should see that there is little effective time in most of the threads (green) and that the amount of MPI overhead is also small (yellow). This points to potential issues in the threading implementation.
To investigate this further,
By selecting this grouping, you get better clarity with the roles of each MPI Rank and each thread. The top bar for each process shows the average result for all children threads. Below that average, each thread is listed with its own thread number and process ID.
In our example, the primary thread takes care of all MPI communication for each MPI rank. This behavior is common in hybrid applications. A significant amount of time is spent in MPI communication (yellow) in the first ten seconds of the execution, likely to set up the problem and distribute data. After that period, there is regular MPI communication, which matches the results observed in the Bandwidth Utilization timeline and the Summary report.
The high amount of spin and overhead (shown in red by default) is noticeable. This indicates issues with the way threading was implemented in the application.
You can configure an analysis in Intel VTune Profiler using the GUI and then save the equivalent command to run the analysis directly from the command line. Use this feature for heavily customized profiles or for quickly building a complex command.
In the Where pane, select Arbitrary Host (not connected) and specify the hardware platform.
In the What pane:
In the How pane, change the default Hotspots analysis to HPC Performance Characterization. Customize the available options.
For Intel MPI, the command line is generated in terms of the -gtool option. Use this option to simplify selective rank profiling syntax.
Intel VTune Profiler provides informative command line text reports. For example, to obtain a summary report, run:
vtune -report summary -r ./results_dir
A summary of the results prints to the screen. Options to save the output directly to file and in other formats (csv, xml, html) are also available. For details on the full command line options, type vtunel -help in the command line or see Intel® VTune™ Profiler Command Line Interface.
By default, Intel VTune Profiler collects performance statistics for the whole application. The 2019.3 and newer versions of Intel VTune Profiler contain the ability to control data collection for MPI applications. There are several advantages to this capability:
The region selection process is done using the standard MPI_Pcontrol function. Call MPI_Pcontrol(0) to pause data collection and call MPI_Pcontrol(1) to resume it again.
You can use the API together with the command line option -start-paused to exclude the application initialization phase. In this case, a MPI_Pcontrol(1) call should follow right after initialization to resume data collection. This method of controlling collection requires no changes in the application building process, unlike using ITT API calls, which require linking of a static ITT API library.
Product and Performance Information |
---|
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Notice revision #20201201 |