Analyze Vectorization and Memory Aspects of an MPI Application

Since a distributed HPC application runs on a collection of several discrete nodes, apart from optimizing MPI communications across nodes and within nodes, you must also account for optimizations like vectorization on a per-node basis. This recipe explains how to use the vectorization and memory-specific capabilities and recommendations of the Intel® Advisor features to analyze an MPI application.

To analyze an MPI application with the Intel Advisor, do the following:

  1. Prerequisites.

  2. Survey your target application.

  3. Collect Trip Counts and FLOP data and review the results.

  4. Review the Roofline report.

  5. [Optional] Run the Dependencies analysis.

  6. [Optional] Run the Memory Access Patterns analysis.

Scenario

You can collect data for MPI applications only with the Intel Advisor CLI, but you can view the results with the standalone GUI, as well as the command line. You can also use the GUI to generate required command lines. For more information about this feature, see Generate Command Lines from GUI in the Intel® Advisor User Guide.

This recipe describes an example workflow of analyzing the Weather Research and Forecasting (WRF) Model*, which is a popular MPI-based numerical application for weather prediction. Depending on a type of your MPI application, you can collect data on a different number of ranks:

Ingredients

This section lists the hardware and software used to produce the specific result shown in this recipe:

Prerequisites

  1. Set up the environment for the required software:

    source <compilers_installdir>/bin/compilervars.sh  intel64
    source <mpi_library_installdir>/intel64/bin/mpivars.sh
    source <advisor_installdir>/advixe-vars.sh
    

    To verify that you successfully set up the tools, you can run the following commands. You should get the product versions.

    mpiicc -v
    mpiifort -v
    mpiexec -V
    advixe-cl --version
    
  2. Set the environment variables required for the WRF application:

    export LD_LIBRARY_PATH=/path_to_IO_libs/lib:$LD_LIBRARY_PATH
    ulimit -s unlimited
    export WRFIO_NCD_LARGE_FILE_SUPPORT=1
    export KMP_STACKSIZE=512M
    export OMP_NUM_THREADS=1
    
  3. Build the application in the Release mode. The -g compile-time flag is recommended so that Intel Advisor can show source names and locations.

Survey Your Target Application

The first step is to run the Survey analysis on the target application using the Intel Advisor CLI. This analysis type collects high-level details about the target application. To run the analysis:

To execute the WRF application with 48 ranks on a single node of the Intel® Xeon® processor and attach the Survey analysis to the rank 0 only:

mpiexec -genvall -n 48 -ppn 48 -gtool "advixe-cl --collect=survey –-project-dir=<project_dir>/project1:0" ./wrf.exe

This command will generate a result folder for the rank 0 only containing the survey data.

You can also run the analysis for a set of ranks, for example 0, 10 through 15, and 47:

mpiexec -genvall -n 48 -ppn 48 -gtool "advixe-cl --collect=survey –-project-dir=<project_dir>/project1:0,10-15,47" ./wrf.exe

For details on MPI command syntax with the Intel Advisor, see Analyze MPI Workloads.

Note

Currently, Intel Advisor does not merge results from multiple ranks. If you run an analysis for more than one rank, Intel® Advisor creates a separate result folder for every rank analyzed.

Collect Trip Counts and FLOP Data and Review the Results

After running the Survey analysis, you can view the Survey data collected for your application or collect additional information about trip counts and FLOP. To run the Trip Counts and FLOP analysis only for the rank 0 of the WRF application, execute the following command:

mpiexec -genvall -n 48 -ppn 48 -gtool "advixe-cl --collect=tripcounts --flop -–project-dir=<project_dir>/project1:0" ./wrf.exe

This command reads the collected survey data and adds details on trip counts and FLOP to it.

You can view the collected results on a remote machine or move the results to a local machine and view them with the Intel Advisor GUI. To visualize results on the local machine, you can do the following:

  1. Pack the result files, corresponding sources and binaries in a single snapshot file with .advixeexpz extension:

    advixe-cl --snapshot --project-dir=<project_dir>/project1 --pack --cache-sources --cache-binaries -- <snapshot_name>
  2. Move this snapshot to a local machine and open it with the Intel Advisor GUI.

Note the following in the results generated:

Review the Roofline Chart

Intel Advisor can plot roofline charts for applications that visualize application performance levels relatively to the system's peak compute performance and memory bandwidths. To generate a roofline report for an MPI application, you must run the Survey and Trip Counts and FLOP analyses one after the other as described in the previous sections. These analyses collect all the data required to plot roofline charts for MPI applications.

To open the Roofline report, click the Roofline toggle button on the top left pane of the analysis results opened in Intel Advisor GUI. Note the following:

Adjust the Roofline Chart

Since Intel® Advisor was executed on a single rank 0, the dots show performance of the rank 0 only and not the full application performance. As a result, the distance between loops and roofs (relative dots positions) in the Roofline chart shows a poorer performance than it is in reality.

To adjust the dot positions, change the number of cores from a drop-down list in the top pane of the Roofline chart to the total number of MPI ranks in the application. For the WRF application, choose 48 cores. Changing the number of cores adjusts the system memory and compute roofs accordingly in the Roofline chart. The relative positions of dots change based on the roofs plotted, but their absolute values do not change.

Comparison of Roofline charts for the WRF application plotted for one core and for 48 cores

Export the Roofline Report (optional)

For MPI applications, the recommended way to get a separate Roofline report to share is to export it as an HTML or SVG file. Do one of the following:

For more information on the Roofline, please refer to the Intel® Advisor Roofline article.

Run the Dependencies Analysis (optional)

Intel Advisor may require more details about your application performance to make useful recommendation. For example, Intel Advisor may recommend running the Dependencies analysis for some loops that have Assumed dependency present message in the Performance Issues column of the Survey report.

To collect the dependencies data, you must choose specific loops to analyze. For MPI applications, choose one of the following:

After you run the Dependencies analysis, the results will be added to the Refinement Reports tab of the analysis results. For the WRF application, the Dependencies analysis confirms that there were no dependencies in the selected loops and the Recommendations tab suggests related optimization steps.

Dependencies report for the WRF application

Run the Memory Access Patterns Analysis (optional)

If you want to check your MPI application for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses, run the Memory Access Patterns (MAP) analysis. Intel Advisor may recommend running the MAP analysis for some loops that have Possible inefficient memory access patterns present message in the Performance Issues column of the Survey report. To run the MAP analysis:

  1. Identify loop IDs or source locations to run the deeper analysis on.

  2. Run the MAP analysis for the selected loops (155 and 200 in this case) on the rank 0 of the WRF application:

    mpiexec -genvall -n 48 -ppn 48 -gtool "advixe-cl --collect=map --mark-up-list=155,200 --project-dir=<project_dir>/project1:0" ./wrf.exe

After you run the MAP analysis, the results will be added to the Refinement Reports tab of the analysis results. For the WRF application, the MAP analysis reported that all strides for the selected loops are random in nature, which could cause suboptimal memory and vectorization performance. See the messages in the Recommendations tab for potential next steps.

Memory Access Patterns report for the WRF application

Key Take-Aways

See Also

This section lists links to all documents and resources the recipe refers to: