Result Interpretation
By default, the GPU-to-GPU performance modeling results are generated to
<project-dir>/e<NNN>/pp<NNN>/data.0. To view the results, go to this directory or the directory that you specified with the
out-dir option and open an interactive HTML report
report.html.
The structure and controls of the HTML report generated for the GPU-to-GPU performance modeling are similar to the HTML report for the CPU-to-GPU offload modeling, but the content is different because for the GPU-to-GPU modeling,
Intel Advisor models performance
only for GPU-enabled parts of your application.
The report includes the following tabs
Summary,
Offloaded Regions,
Non-Offloaded Regions,
Call Tree,
Configuration,
Logs. You can switch the tabs using the links in the top left.
Note
The
Non-Offloaded Regions tab shows only GPU kernels that cannot be modeled. If all kernels are modeled, the tab is empty. For example, it might show kernels with some required metrics missing. For the GPU-to-GPU modeling, estimated speedup is not a reason for not offloading a kernel.
When you open the report, it first shows the
Summary tab. In this tab, you can review the summary of the modeling results and estimated performance metrics for some GPU kernels in your application.

- In the
Program Metrics pane, compare the
Time on Baseline GPU and
Time on Target GPU and examine the
Average Speedup to understand if GPU kernels in your application have a better performance on the target GPU.
Time on Baseline GPU includes
only execution time of GPU kernels and ignores the CPU parts of your application.
Time on Target GPU includes estimated execution time for GPU kernels on the target and offload taxes.
In the right-side pie chart, review the time on target GPU components and see where the GPU kernels spend most of the time: executing on the target GPU (Estimated Time on GPU), transferring data between the host device and the target GPU (Data Transfer Tax), or scheduling kernels on the target GPU (Kernel Launch Tax).
- In the
Offloads Bounded by pane, examine what the GPU kernels are potentially bounded by on the target GPU. The parameters with the highest percentage mean that this is where the GPU kernels spend the most time. Review the detailed metrics for these parameters in other tabs to understand if you need to optimize your application for this.
- In the
Target Device Configuration pane (in the top-right), examine the target GPU parameters that were used to model the GPU kernel performance. You can also use the sliders to adjust the parameters and create your custom configuration.
Note
To model performance for the custom configuration, save the configuration settings from the report and rerun the performance modeling step with the
analyze.py for the new configuration file. For details about using the custom configurations, go to the
Configuration tab and review the comments.
- In the
Top offloaded pane, review the top five GPU kernels with the highest absolute offload gain (in seconds) estimated on the target GPU. The gain is calculated as
(Time measured on the baseline GPU - Time estimated on the target GPU). This pane shows
all GPU kernels in your application and might also show kernels with the estimated speedup less than 1.
For each kernel in the pane, you can review the speedup, time on the baseline and the target GPUs, main bounded-by parameters, and estimated amount of data transferred.
Note
The
Top non offloaded pane shows only GPU kernels that cannot be modeled. If all kernels are modeled, the pane is empty. For the GPU-to-GPU modeling, estimated speedup is not a reason for not offloading a kernel.
To see the details about each GPU kernel, go to the
Offloaded Regions or the
Call Tree tab. These tabs report the same metrics, but the
Offloaded Regions shows only modeled kernels, while the
Call Tree shows all kernels, including non-modeled ones.

- In the
metrics table, examine the detailed performance metrics for the GPU kernels. The
Measured column group shows metrics
measured on the baseline GPU. Other column groups show metrics
estimated for the target GPU. You can expand column groups to see more metrics.
For example, to find a potential bottleneck, you can examine the
Offload Information column group focusing on the
Bounded by and
Total Execution Time by metrics. For details about the bounding factor, scroll right to the column group corresponding to the value reported in the
Bounded by column, for example,
L3 Cache,
DRAM, or
LLC. Expand the column group and examine the
Total
<name> Bandwidth Utilization column. The utilization is calculated as a
relation of average memory level bandwidth to its peak bandwidth. High value means that the kernel does not use well this memory level and it is the potential bottleneck.
You can also review the following data to find bottlenecks:
- If you see high cache or memory bandwidth utilization (for example, in the
L3 Cache,
SLM,
LLC column groups), consider optimizing cache/memory traffic to improve performance.
- If you see high latency in the
Offload Information column group, consider optimizing cache/memory latency by scheduling enough parallel work for this kernel to increase thread occupancy.
- If you see high data transfer tax in the
Overhead, consider optimizing data transfer taxes or using unified shared memory (USM).
You can also focus on the most interesting data to analyze your problem using sort and filter controls:
- To filter data in the column, hover over a column title and click the menu icon or click the right-side
Custom filter button. In the filter tab, deselect values you want to hide from the table or specify filter criteria. For example, you can select to see only specific kernels of interest and hide all other kernels using the filter to the
Hierarchy column.
- To configure the table metrics, click the right-side
Column configurator button and select columns to show in the table and/or deselect columns or column groups to hide from the table. For example, if you want to analyze how well your application uses memory resources on the target GPU, you can select to show only memory-related column groups.
- In the right-side
Source pane, see the source code associated with a kernel, if available. You need to select a kernel from the metrics table to see the source.
- In the right-side
Memory objects pane, see the details about memory objects transferred between the host device and a target GPU for a kernel. You need to select a kernel from the metrics table to see the memory objects data. Examine this pane if you see a high data transfer tax for a kernel. The pane includes two parts:
Go to the
Configuration tab to review the detailed target device configuration used for modeling in a
read-only mode. You can also review the comments for each parameter and their possible values.
Go to the
Logs tab to see a command line used to run the analyses and all output messages reported in console during the script(s) execution. This tab reports four types of messages:
Error,
Warning,
Info, and
Debug in the order of their appearance in console during the script(s) execution.
Note
By default, only
Error,
Warning, and
Info messages are shown. To control types of messages shown, hover over the
Severity column header and click the menu icon to open filters pane.