Intel® Advisor Help

Run Offload Modeling Perspective from Command Line

Intel® Advisor provides several methods to run the Offload Modeling perspective. These methods vary in simplicity and flexibility:

Note

You can also run Offload Modeling using a different combination of Python scripts and/or advisor CLI. For example, you can use advisor CLI to collect performance data and run analyze.py to model performance, or run run_oa.py for the data collection and the first modeling and run analyze.py if you want to remodel performance for different configuration.

Note

You can run the Python* scripts with Python 3.6 or 3.7 or the advisor-python command line interface of the Intel Advisor.

The Python script methods do not support MPI applications.

Prerequisites

Use advisor Command Line Interface

This method is the most flexible and can analyze MPI applications.

Note

In the commands below:
  • Replace <APM> with $APM on Linux OS or with %APM% on Windows OS.
  • Options in square brackets ([--<option>]) are recommended if you want to change how to collect data or model application performance. See advisor Command Line Interface for syntax details.

You can generate command lines for your application and configuration with one of the following:

Copy the commands to the clipboard and run them one by one from the command line. The commands generated might require you to add certain options and steps (for example, mark up) to complete the flow.

Run the perspective as follows:

  1. Run the Survey analysis to collect basic performance metrics:

    advisor --collect=survey --project-dir=<project-dir> --stackwalk-mode=online --static-instruction-mix -- <target-application> [<target-options>]

    where:

    • --stackwalk-mode=online is an option to analyze stacks during collection. The online mode is recommended for profiling applications executed on CPU.

      To profile a DPC++, C++/Fortran with OpenMP target, or OpenCL application running on a CPU, set the option to offline to analyze stacks after collection..

    • --static-instruction-mix is an option to collect static instruction mix data. This option is recommended.
  2. Run the Trip Counts and FLOP analysis to analyze loop call count and model data transfers on the target device:

    advisor --collect=tripcounts --project-dir=<project-dir> --flop --target-device=<target> [--enable-cache-simulation] [--stacks] [--data-transfer=<mode>] [--profile-jit] -- <target-application> [<target-options>]

    where:

    • --flop is an option to collect data about floating-point and integer operations, memory traffic, and mask utilization metrics for AVX-512 platforms.
    • --target-device=<target> is a specific target graphics processing unit (GPU) to model cache for. For example, gen11_icl (default), gen12_dg1, or gen9_gt3. See target-device for a full list of possible values.

      Important

      Make sure to specify the same target device as for the --collect=projection --config=<config>.
    • --enable-cache-simulation is an option to enable modeling cache behavior.
    • --stacks is an option to enable advanced collection of call stack data.
    • --data-transfer=<mode> is an option to enable modeling data transfers between host and target devices. Use off (default) to disable data transfer modeling, light to model only data transfers, medium to model data transfers, attribute memory objects, and track accesses to stack memory, full to enable data reuse analysis as well.
    • --profile-jit is an option to analyze DPC++, C++/Fortran with OpenMP target, or OpenCL code regions running on a CPU.
  3. Optional: Check for loop-carried dependencies in a selected loops:

    advisor --collect=dependencies --project-dir=<project-dir> --select markup=gpu_generic --loop-call-count-limit=16 [--select=<string>] [--filter-reductions] -- <target-application> [<target-options>]

    where:

    • --loop-call-count-limit=16 is the maximum number of call instances to analyze assuming similar runtime properties over different call instances. This value is recommended.
    • --select markup=gpu_generic selects loops profitable for offloading to a target device to run the Dependencies analysis.

      For more information about markup options, see Loop Markup to Minimize Overhead.

      Note

      The generic markup strategy is recommended if you have an application that does not use DPC++, C++/Fortran with OpenMP target, or OpenCL, and you want to run the Dependencies analysis for it.
    • --filter-reductions is an option to mark all potential reductions with a specific diagnostic.

    Information about loop-carried dependencies is important for modeling performance of scalar loops. See Check How Assumed Dependencies Affect Modeling.

  4. Model application performance with the projection analysis:

    advisor --collect=projection --project-dir=<project-dir> --config=<config> [--no-assume-dependencies] [--data-reuse-analysis] [--assume-hide-taxes] [--jit] [--custom-config=<path>]

    where:

    • --config=<config> is a target GPU configuration to model performance for. For example, gen11_icl (default), gen12_dg1, or gen9_gt3. See config for a full list of possible values.
    • --no-assume-dependencies is an option to assume that a loop does not have dependencies if a loop dependency type is unknown. Use this option if your application contains parallel and/or vectorized loops and you did not run the Dependencies analysis.
    • --data-reuse-analysis is an option to analyze potential data reuse between code regions when offloaded to a target GPU.

      Important

      Make sure to use --data-transfer=full when collecting Trip Counts data with --collect=tripcounts for this option to work correctly.
    • --assume-hide-taxes is an option to assume that an invocation tax is paid only for the first time a kernel is launched.
    • --custom-config=<path> is a path to a custom .toml configuration file with additional modeling parameters. For details, see Advanced Modeling Configurations.
    • --jit is an option to model performance of DPC++, C++/Fortran with OpenMP target, or OpenCL code regions running on a CPU.

See advisor Command Line Interface Reference for more options.

Example

Collect performance data, check for dependencies for potentially profitable loops, model application performance and data transfers on a Intel® Iris® Xe MAX graphics (gen12_dg1 configuration):

advisor --collect=survey --project-dir=./advi --stackwalk-mode=online --static-instruction-mix -- myApplication
advisor --collect=tripcounts --project-dir=./advi --flop --enable-cache-simulation --target-device=gen12_dg1 --stacks --data-transfer=light -- myApplication
advisor --collect=dependencies --project-dir=./advi --select markup=gpu_generic --filter-reductions --loop-call-count-limit=16 -- myApplication
advisor --collect=projection --project-dir=./advi --config=gen12_dg1

Run the collect.py and analyze.py Scripts

collect.py automates profiling and allows you to run all analysis steps in one command, while analyze.py models performance of your application on a target device. This method is simple, moderately flexible, but it does not support MPI applications.

Note

In the commands below:
  • Replace <APM> with $APM on Linux OS or with %APM% on Windows OS.
  • Options in square brackets ([--<option>]) are recommended if you want to change how to collect data or model application performance. See advisor Command Line Interface for syntax details.

Run the scripts as follows:

  1. Collect application performance metrics with collect.py:

    advisor-python <APM>/collect.py <project-dir> [--collect=<collect-mode>] [--config=<config-file>] [--markup=<markup-mode>] [--jit] -- <target> [<target-options>]

    where:

    • --collect=<collect-mode> is an option to specify what data is collected for your application:
      • Use basic to collect only basic Survey, Trip Counts and FLOP, analyze data transfer between host and device memory, attribute memory objects to loops, and track accesses to stack memory. This value corresponds to the Medium accuracy level in the Intel Advisor graphical user interface (GUI).
      • Use refinement to collect only Dependencies. Do not analyze data transfers.
      • Use full (default) to collect Survey, Trip Counts and FLOP, and Dependencies data, analyze data transfer between host and device memory and potential data reuse, attribute memory objects to loops, and track accesses to stack memory. This value corresponds to the High accuracy level in the Intel Advisor GUI.

      See Check How Dependencies Affect Modeling for details when you need to collect dependency data.

    • --config=<config-file> is a target GPU configuration to model performance for. For example, gen11_icl (default), gen12_dg1, or gen9_gt3.

      Important

      Make sure to specify the same configuration file for collect.py and for analyze.py.
    • --markup=<markup-mode> is loops to collect Trip Counts and FLOP and/or Dependencies data for. This option decreases collection overhead. By default, it is set to generic to analyze all loops profitable for offloading.
    • --jit is an option to model performance of DPC++, C++/Fortran with OpenMP target, or OpenCL code regions running on a CPU.
  2. Model performance of your application on a target GPU device with a selected configuration with analyze.py:

    advisor-python <APM>/analyze.py <project-dir> [--config=<config-file>] [--assume-parallel] [--data-reuse-analysis] [--jit]

    where:

    • --config=<config-file> is a target GPU configuration to model performance for. For example, gen11_icl (default), gen12_dg1, or gen9_gt3.

      Important

      Make sure to specify the same configuration file for collect.py and for analyze.py.
    • --assume-parallel is an option to assume that a loop does not have dependencies if there is no information about the loop dependency type and you did not run the Dependencies analysis (with collect.py --collect=basic). For details, see Check How Dependencies Affect Modeling.
    • --data-reuse-analysis is an option to analyze potential data reuse between code regions when offloaded to a target GPU.

      Important

      Make sure to use --collect=full when running the analyses with collect.py or use the --data-transfer=full when running the Trip Counts analysis with advisor CLI.
    • --jit is an option to model performance of DPC++, C++/Fortran with OpenMP target, or OpenCL code regions running on a CPU.

See collect.py Script and analyze.py Script reference for a detailed option description and a full list of available options.

Example

Collect performance data and model application performance on a target GPU with the Intel® Iris® Xe MAX graphics (gen12_dg1 configuration) on Linux OS:

advisor-python $APM/collect.py ./advi --config=gen12_dg1 -- myApplication
advisor-python $APM/analyze.py ./advi --config=gen12_dg1

Run the run_oa.py Script

This method is the simplest, but less flexible, and it does not support analysis of MPI applications. You can use it to run all collection and modeling steps with one script.

Note

In the command below:
  • Replace <APM> with $APM on Linux OS or with %APM% on Windows OS.
  • Options in square brackets ([--<option>]) are recommended if you want to change how to collect data or model application performance. See advisor Command Line Interface for syntax details.

Run the script as follows:

advisor-python <APM>/run_oa.py <project-dir> [--collect=<collect-mode>] [--config=<config-file>] [--markup=<markup-mode>] [--data-reuse-analysis] [--jit] -- <target> [<target-options>]

where:

See run_oa.py Script reference for a full list of available options.

Example

Run the full collection and modeling with the run_oa.py script with default gen11_icl configuration on Linux OS:

advisor-python $APM/run_oa.py ./advi -- myApplication

View the Results

Intel Advisor provides several ways to work with the Offload Modeling results generated from the command line.

View Results in CLI

After you run Performance Modeling with advisor --collect=projection or analyze.py, the result summary is printed in a terminal or a command prompt. In this summary report, you can view:

For example:

Info: Selected accelerator to analyze: Intel Gen9 GT2 Integrated Accelerator 24EU 1150MHz.
Info: Baseline Host: Intel® Core™ i7-9700K CPU @ 3.60GHz, GPU: Intel ® .
Info: Binary Name: 'CFD'.

Measured CPU Time: 44.858s    Accelerated CPU+GPU Time: 15.425s
Speedup for Accelerated Code: 3.8x    Number of Offloads: 5    Fraction of Accelerated Code: 60%

Top Offloaded Regions
--------------------------------------------------------------------------------------------------------------------------------------------------
 Location                                                | Time on Baseline | Time on Target | Speedup   | Bound by               | Data Transfer
--------------------------------------------------------------------------------------------------------------------------------------------------
 [loop in compute_flux_ser at euler3d_cpu_ser.cpp:226]   |          36.576s |         9.103s |     4.02x | L3_BW                  |     12.091MB
 [loop in time_step_ser at euler3d_cpu_ser.cpp:361]      |           1.404s |         0.319s |     4.40x | L3_BW                  |     10.506MB
 [loop in compute_step_factor_ser at euler3d_cpu_ser.... |           0.844s |         0.158s |     5.35x | Compute                |      4.682MB
 [loop in main at euler3d.cpp:848]                       |           1.046s |         0.906s |     1.15x | Dependency             |     31.863MB
 [loop in Intel::OpenCL::TaskExecutor::in_order_execu... |           0.060s |         0.012s |     4.98x | Dependency             |      0.303MB
--------------------------------------------------------------------------------------------------------------------------------------------------

See Accelerator Metrics reference for more information about the metrics reported.

View Results in GUI

When you run Intel Advisor CLI or Python scripts, an .advixeproj project is created automatically in the directory specified with --project-dir. This project is interactive and stores all the collected results and analysis configurations. You can view it in the Intel Advisor GUI.

To open the project in GUI, you can run the following command from a command prompt:

advisor-gui <project-dir>

Note

If the report does not open, click Show Result on the Welcome pane.

You first see a Summary report that includes the most important information about measured performance on a baseline device and modeled performance on a target device, including:

Offload Modeling Summary in GUI

View an Interactive HTML Report

When you run Intel Advisor CLI or Python scripts, an additional set of CSV metric reports and an interactive HTML report is generated in the <project-dir>/e<NNN>/pp<NNN>/data.0 directory. These reports are light-weighted and can be easily shared as they do not require Intel Advisor GUI.

The HTML report is similar to the GUI project, but also reports additional metrics. The report contains a list of regions profitable for offloading and performance metrics, like offload data transfer traffic, estimated number of cycles on a target device, estimated speed-up, compute vs memory-bound characterization.

Offload Modeling HTML report

Save a Read-only Snapshot

A snapshot is a read-only copy of a project result, which you can view at any time using the Intel Advisor GUI. To save an active project result as a read-only snapshot:

advisor --snapshot --project-dir=<project-dir> [--cache-sources] [--cache-binaries] -- <snapshot-path>

where:

  • --cache-sources is an option to add application source code to the snapshot.

  • --cache-binaries is an option to add application binaries to the snapshot.

  • <snapshot-path is a path and a name for the snapshot. For example, if you specify /tmp/new_snapshot, a snapshot is saved in a tmp directory as new_snapshot.advixeexpz. You can skip this and save the snapshot to a current directory as snapshotXXX.advixeexpz.

To open the result snapshot in the Intel Advisor GUI, you can run the following command:

advisor-gui <snapshot-path>

You can visually compare the saved snapshot against the current active result or other snapshot results.

Next Steps

See Identify Code Regions to Offload to understand the results. This section is GUI-focused, but you can still use to it for interpretation.

For details about metrics reported, see Accelerator Metrics.

See Also