Intel® Advisor Help

Model MPI Application Performance on GPU

You can model your MPI application performance on an accelerator to determine whether it can benefit from offloading to a target device.

Note

For MPI applications, you can collect data only with advisor command line interface (CLI).
  1. Optional: Generate pre-configured command lines for your application:
    1. Run the Offload collection in a dry-run mode to generate the command lines:

      advisor --collect=offload --dry-run --project-dir=<project-dir> -- ./myApplication [<application-options>]

      It will print a list of commands for each analysis step necessary to get Offload Modeling result with the specified accuracy level (for the commands above, it is low).

      For dry-run accuracy levels and other ways to generate the commands, see Optional: Generate pre-Configured Command Lines.

    2. Copy the generated commands to your preferred text editor and modify them for the MPI-specific syntax. See Analyze MPI Applications for details about the syntax.
  2. Run the Intel Advisor analyses to collect metrics for your application running on a host device with advisor CLI. For example, the full collection workflow with the Intel® MPI Library gtool with mpiexec:

    mpiexec –gtool “advisor --collect=survey --static-instruction-mix --project-dir=<project-dir>:<ranks-set>” -n <N> ./myApplication [<application-options>]

    mpiexec –gtool “advisor --collect=tripcounts --flop --enable-cache-simulation --target-device=<target-gpu> --stacks --data-transfer=light --project-dir=<project-dir>:<ranks-set>” -n <N> ./myApplication [<application-options>]

    mpiexec –gtool “advisor --collect=dependencies --select markup=gpu_generic --loop-call-count-limit=16 --project-dir=<project-dir>:<ranks-set>” -n <N> ./myApplication [<application-options>]

    where:

    • <ranks-set> is the set of MPI ranks to run the analysis for. Separate ranks with a comma, or use a dash "-" to set a range of ranks. Use all to analyze all the ranks.
    • <N> is the number of MPI processes to launch.
    • <target-gpu> is a GPU configuration to model cache for. See --target-device for available configurations.
  3. Model performance of your application on a target device for a single rank with one of the following:
    • With the advisor CLI:

      advisor --collect=projeciton --mpi-rank=<n> --config=<target-gpu> --project-dir=<project-dir>

    • With the analyze.py script:

      advisor-python <APM>/analyze.py <project-dir> --mpi-rank <n> --config <target-gpu>

    where:

    • <APM> is an environment variable for a path to Offload Modeling scripts. For Linux* OS, replace it with $APM, for Windows* OS, replace it with %APM%.
    • <n> is the rank number to model performance for.

      Instead of --mpi-rank=<n>, you can specify path to rank folder in the project directory. This is only supported by the analyze.py script:

      advisor-python <APM>/analyze.py <project-dir>/rank.<n> [--options]

    • <target-gpu> is a GPU configuration to model the application performance for. Make sure to specify the same target as for the Trip Counts step. See --config for available configurations.

View the Results

For Offload Modeling, the reports are generated automatically after you run performance modeling. You can either open a result project file (*.advixeproj) located in the <project-dir> using the Intel Advisor GUI or view an HTML/CSV report in the respective rank directory at <project-dir>/rank.<n>/e<NNN>/pp<NNN>/data.0.

Model Performance for Multi-Rank MPI

By default, Offload Modeling is optimized to model performance for a single-rank MPI application. For multi-rank MPI applications, do one of the following:

Scale Target Device Parameters

By default, Offload Modeling assumes that one MPI process is mapped to one GPU tile. You can configure the performance model and map MPI ranks to a target device configuration. To do this, you need to set the number of tiles per MPI process by scaling the Tiles_per_process target device parameter in a command line or a TOML configuration file. The parameter sets a fraction of a GPU tile that corresponds to a single MPI process and accepts values from 0.01 to 12.0.

The number of tiles per process you set automatically adjusts:

  • the number of execution units (EU)
  • SLM, L1, L3 sizes and bandwidth
  • memory bandwidth
  • PCIe* bandwidth

Consider the following value examples:

Tiles_per_process Value

Number of MPI Ranks per Tile

1 (default)

1

12 (maximum)

1/12

0.25

4

0.125

8

Info: In the commands below, make sure to replace the myApplication with your application executable path and name before executing a command. If your application requires additional command line options, add them after the executable name.

To run the Offload Modeling with a scaled tile-per-process parameter:

Method 1. Scale the parameter during the analysis. This is a one-time change applied only to the analysis you run it with.

  1. Run the script in the dry-run mode to generate commands lines with the cache configuration adjusted to the specified number of tiles per process. For example, to generate commands for the ./advi_results project and model performance with 0.25 tiles per process, which corresponds to four MPI ranks per tile:
    advisor-python $APM/collect.py ./advi_results --dry-run --set-parameter scale.Tiles_per_process=0.25 -- ./myApplication

    You can specify any value from 0.01 to 12.0 for the scale.Tiles_per_process parameter.

    This command generates a set of command lines for the Offload Modeling workflow that runs the collection with the advisor CLI with parameters adjusted for the configuration.

  2. Copy the generated commands to your preferred text editor and modify them for the MPI-specific syntax. See the list above for command templates.
  3. Optional: If you have not collected performance data for your application, run the Survey analysis using the generated and modified Survey command.
  4. From your text editor, copy the modified command for the Trip Counts analysis and run it from the shell. For example, the command from the previous step should look as follows if run for the Intel® MPI Library:
    mpiexec –gtool “advisor --collect=tripcounts --project-dir=./advi_results --flop --ignore-checksums --data-transfer=medium --stacks --profile-jit --cache-sources --enable-cache-simulation --cache-config=8:1w:4k/1:192w:3m/1:16w:8m” -n 4 ./myApplication

    This command adjusts metrics for the new cache configuration.

  5. Run the performance modeling for one MPI rank with the number of tiles per MPI processed specified. For example, with the advisor CLI for the MPI rank 4:
    advisor --collect=projection --project-dir=./advi_results --set-parameter scale.Tiles_per_process=0.25 --mpi-rank=4

    Important

    Make sure to specify the same value for the --set-parameter scale.Tiles_per_process as for the Trip Counts step.

    The report for the specified MPI rank will be generated in the project directory. Proceed to view the results.

Method 2. Create a custom configuration file to use with any device configuration.

  1. Scale the parameter with one of the following:
    • Create a TOML file, for example, my_config.toml. Specify the parameter as follows:
      [scale]
      Tiles_per_process = <float>

      where <float> is a fraction of a GPU tile that corresponds to a single MPI process.

    • Use scalers in the legacy Offload Modeling HTML report:
      1. Run the Performance Modeling for your application without scaling.
      2. Go to <project-dir>/rank.<N>/e<NNN>/report/ and open the legacy Offload Modeling HTML report report.html.
      3. In the Summary tab, set the MPI Tile per Process scaler in the configuration pane to a desired value.
      4. Click the Download configuration file icon to save the current configuration as scalers.toml
  2. Re-run the performance modeling with the custom TOML file. For example, with my_config.toml:
    advisor --collect=projection --config=gen12_tgl --custom-config=./my_config.toml --mpi-rank=4 --project-dir=./advi_results

The report for the specified MPI rank will be generated in the project directory. Proceed to view the results.

Ignore MPI Time

For multi-rank MPI workloads, time spent in MPI runtime can differ from rank to rank and cause differences in the whole application time and Offload Modeling projections. If MPI time is significant and you see the differences between ranks, you can exclude time spent in MPI routines from the analysis.

  1. Collect performance data for your application using the advisor CLI.
  2. Run the performance modeling with time in MPI calls ignored using the --ignore option. For example, with the advisor CLI:
    advisor --collect=projection --project-dir=./advi_results --ignore=MPI --mpi-rank=4

In the report generated, all per-application performance modeling metrics are re-calculated based on application self time excluding time spent in MPI calls from the analysis. This should improve modeling across ranks.

Note

This option affects only metrics for a whole program in the Summary tab. Metrics for individual regions are not recalculated.

See Also