Intel® Advisor Help
Any
Survey with GPU profiling + FLOP (Characterization) with GPU profiling
Identify optimization headroom and performance bottlenecks for your kernel using the GPU Roofline chart that visualizes your application performance against the maximum capabilities of your hardware.
To read the GPU Roofline chart:
The dots on the chart correspond to kernels running on GPU. In accordance with Amdahl's Law, optimizing the loops that take the largest portion of the program's total run time will lead to greater speedups than optimizing the loops that take a smaller portion of the run time. By dot size and color of a dot, identify kernels that take most of total GPU time and/or are located very low in the chart. For example:
The roofs above a dot represent the restrictions preventing it from achieving a higher performance. A dot cannot exceed the topmost rooflines, as they represent the maximum capabilities of the hardware. The farther a dot is from the topmost roofs, the more room for improvement there is. Highlight the roof that limits the performance of your kernel by double-clicking a dot on the chart.
Explore Performance of Kernel Instances
You can review and compare the performance of instances of your kernel initialized with different global and local work size. You can do this using the following:
By default, GPU Roofline collects data for all memory levels. This enables you to examine each kernel at different cache levels and arithmetic intensities and provides precise insights into which cache level causes the performance bottlenecks.
Configure Memory-Level Roofline Chart
Interpret Memory-Level GPU Roofline Data
Examine the relationships between the displayed memory levels and highlight the roof that limits your kernel performance the most by double-clicking a dot on the GPU Roofline chart. Labeled dots and/or X marks are displayed, representing memory levels with arithmetic intensity for the selected kernel at the following memory levels:
The vertical distance between memory dots and their respective roofline shows how much you are limited by a given memory subsystem. If a dot is close to its roof line, it means that the kernel is limited by the bandwidth of this memory level.
The horizontal distance between memory dots indicates how efficiently the kernel uses cache. For example, if L3 and GTI dots are very close on the horizontal axis for a single kernel, the kernel uses L3 and GTI similarly. This means that it does not use L3 and GTI efficiently. Improve re-usage of data in the code to improve application performance.
Arithmetic intensity on the x axis determines the order in which dots are plotted, which can provide some insight into your code's performance. For example, the CARM dot is typically far to the right of the L3 dot, as read/write access by cache lines and CARM traffic is the sum of actual bytes used in operations. To identify room for optimization, check L3 cache line utilization metric for a given kernel. If the L3 cache line is not utilized well enough, check memory access patterns in your kernel to improve its elapsed time.
Ideally, the CARM and the L3 dots are located close to each other, and the GTI dot is far to the right from them. In this case, your kernel has good memory access patterns and mostly utilizes the L3 cache. If your kernel utilizes the L3 cache line well, your kernel:
Review and compare the changes in traffic from one memory level to another to identify the memory hierarchy bottleneck for the kernel and determine optimization steps based on this information.
Select a dot on the chart and switch to the GPU Details tab in the right-side pane to examine code analytics for a specific kernel in more details.
In the Roofline Guidance pane, examine the Roofline chart for the selected kernel with the following data:
If the arrow points to a diagonal line, the kernel is mostly memory bound. If the arrow points to a horizontal line, the kernel is mostly compute bound. Intel® Advisor displays a compute roof limiting the performance of your kernel based on the instruction mix used.
For example, on the screenshot below, the kernel is bounded by the L3 Bandwidth. If you optimize the memory access patterns in the kernel, it gets up to 5.1x speedup.
Review how well your kernel uses the compute and memory bandwidth of your hardware in the OP/S and Bandwidth pane. It indicates the following metrics:
For example, in the screenshot below, the kernel utilizes 19% of L3 Bandwidth. Compared to utilization metrics for other memory levels and compute capacity, L3 Bandwidth is the main factor limiting the performance of the kernel.
In the Memory Metrics:
Review how much time the kernel spends processing requests for each memory level in relation to the total time, in perspective, reported in the Impacts histogram.
A big value indicates a memory level that bounds the selected kernel. Examine the difference between the two largest bars to see how much throughput you can gain if you reduce the impact on your main bottleneck. It also gives you a long-time plan to reduce your memory bound limitations as once you will solve the problems coming from the widest bar, your next issue will come from the second biggest bar and so on.
Ideally, you should see the L3 or SLM as the most impactful memory.
Review an amount of data that passes through each memory level reported in the Shares histogram.
Examine types of instructions that the kernel executes in the Instruction Mixpane. For example, in a screenshot below, the kernel mostly executes compute instructions with integer operations, which means that the kernel is mostly compute bound.
Intel Advisor automatically determines the data type used in operations and groups the instructions collected during Characterization analysis by the following categories:
Category |
Instruction Types |
---|---|
Compute (FLOP and INTOP) |
|
Memory | LOAD, STORE, SLM_LOAD, SLM_STORE types depending on the argument: send, sendc, sends, sendsc |
Other |
|
Atomic |
|
Get more insights about instructions used in your kernel using Instruction Mix Details pane:
In the Performance Characteristics, review how effectively the kernel uses the GPU resources: activity of all execution units, percentage of time when both FPUs are used, percentage of cycles with a thread scheduled. Ideally, you should see a higher percentage of active execution units and other effectiveness metrics to use more GPU resources.