Intel® Advisor Help

Examine Bottlenecks on GPU Roofline Chart

Accuracy Level

Any

Enabled Analyses

Survey with GPU profiling + FLOP (Characterization) with GPU profiling

Note

Other analyses and properties control a CPU Roofline part of the report, which shows metrics for loops executed on CPU. You can add the CPU Roofline panes to the main view using the button on the top pane. For details about CPU Roofline data, see CPU / Memory Roofline Insights Perspective.

Result Interpretation

Identify optimization headroom and performance bottlenecks for your kernel using the GPU Roofline chart that visualizes your application performance against the maximum capabilities of your hardware.

Example of a GPU Roofline chart

To read the GPU Roofline chart:

  1. Explore the factors that might limit your kernel performance.
    • Horizontal lines indicate compute capacity limitations preventing kernels from achieving better performance without some form of optimization.
    • Diagonal lines indicate memory bandwidth limitations preventing kernels from achieving better performance without some form of optimization:
      • L3 cache roof: Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
      • SLM cache roof: Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
      • GTI roof: Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
      • DRAM roof: Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
  2. Identify top hotspots for optimization.

    The dots on the chart correspond to kernels running on GPU. In accordance with Amdahl's Law, optimizing the loops that take the largest portion of the program's total run time will lead to greater speedups than optimizing the loops that take a smaller portion of the run time. By dot size and color of a dot, identify kernels that take most of total GPU time and/or are located very low in the chart. For example:

    • Small green dots take up relatively little time, so are likely not worth optimizing.
    • Large red dots take up the most time, so the best candidates for optimization are the large, red dots with a large amount of space between them and the topmost roofs.

  3. Identify head room for optimization.

    The roofs above a dot represent the restrictions preventing it from achieving a higher performance. A dot cannot exceed the topmost rooflines, as they represent the maximum capabilities of the hardware. The farther a dot is from the topmost roofs, the more room for improvement there is. Highlight the roof that limits the performance of your kernel by double-clicking a dot on the chart.

Explore Performance of Kernel Instances

You can review and compare the performance of instances of your kernel initialized with different global and local work size. You can do this using the following:

  • GPU Roofline chart:
    1. Click a dot on a Roofline chart and click the + button that appears next to the dot. The dot expands into several dots representing the instances of the selected kernel.
    2. Click a dot representing a kernel instance and view details about its global and local work size in the GPU Details pane.
    3. Hover over dots representing kernel instances to review and compare their performance metrics. Highlight a roofline limiting the performance of a given instance by double-clicking the dot.

  • Grid in the GPU pane:
    1. Expand a source kernel in the grid.
    2. View the information about the work size of the kernel instances by expanding the Work Size column in the grid. To view the count of instances of a given global/local size, expand the Compute Task Details column in the grid and notice the Instance Count metric.
    3. Compare performance metrics for instances of different global and local size using the grid and the GPU Details pane.

Note

Selecting a dot on a chart automatically highlights the respective kernel in the grid and vice versa.

Explore Memory-Level GPU Roofline

By default, GPU Roofline collects data for all memory levels. This enables you to examine each kernel at different cache levels and arithmetic intensities and provides precise insights into which cache level causes the performance bottlenecks.

Configure Memory-Level Roofline Chart

  1. Expand the filter pane in the GPU Roofline chart toolbar.
  2. In the Memory Level section, select the memory levels you want to see metrics for.

    Select memory levels for a GPU Roofline chart

    Note

    By default, GPU Roofline reports data for GTI memory level (for integrated graphics) and HBM/DRAM memory level (for discrete graphics).
  3. Click Apply.

Interpret Memory-Level GPU Roofline Data

Examine the relationships between the displayed memory levels and highlight the roof that limits your kernel performance the most by double-clicking a dot on the GPU Roofline chart. Labeled dots and/or X marks are displayed, representing memory levels with arithmetic intensity for the selected kernel at the following memory levels:

  • CARM: Memory traffic generated by all execution units (EUs). Includes traffic between EUs and corresponding GPU cache or direct traffic to main memory. For each retired instruction with memory arguments, the size of each memory operand in bytes is added to this metric.
  • L3: Data transferred directly between execution units and L3 cache.
  • SLM: Memory access to/from Shared Local Memory (SLM), a dedicated structure within the L3 cache.
  • GTI: Represents GTI traffic/GPU memory read bandwidth, the accesses between the GPU, chip uncore (LLC), and main memory. Use this to get a sense of external memory traffic.

The vertical distance between memory dots and their respective roofline shows how much you are limited by a given memory subsystem. If a dot is close to its roof line, it means that the kernel is limited by the bandwidth of this memory level.

The horizontal distance between memory dots indicates how efficiently the kernel uses cache. For example, if L3 and GTI dots are very close on the horizontal axis for a single kernel, the kernel uses L3 and GTI similarly. This means that it does not use L3 and GTI efficiently. Improve re-usage of data in the code to improve application performance.

Arithmetic intensity on the x axis determines the order in which dots are plotted, which can provide some insight into your code's performance. For example, the CARM dot is typically far to the right of the L3 dot, as read/write access by cache lines and CARM traffic is the sum of actual bytes used in operations. To identify room for optimization, check L3 cache line utilization metric for a given kernel. If the L3 cache line is not utilized well enough, check memory access patterns in your kernel to improve its elapsed time.

Ideally, the CARM and the L3 dots are located close to each other, and the GTI dot is far to the right from them. In this case, your kernel has good memory access patterns and mostly utilizes the L3 cache. If your kernel utilizes the L3 cache line well, your kernel:

  • Spends less time on transferring data between L3 and CARM memory levels
  • Uses as much data as possible for actual calculations
  • Enhances the elapsed time of the kernel and of the entire application

Example of a GPU Roofline chart for all memory levels

Review and compare the changes in traffic from one memory level to another to identify the memory hierarchy bottleneck for the kernel and determine optimization steps based on this information.

Examine Kernel Details

Select a dot on the chart and switch to the GPU Details tab in the right-side pane to examine code analytics for a specific kernel in more details.

In the Roofline Guidance pane, examine the Roofline chart for the selected kernel with the following data:

Review how well your kernel uses the compute and memory bandwidth of your hardware in the OP/S and Bandwidth pane. It indicates the following metrics:

In the Memory Metrics:

Note

Data in the Memory Metrics pane is based on a dominant type of operations in your code (FLOAT or INT).

Examine types of instructions that the kernel executes in the Instruction Mixpane. For example, in a screenshot below, the kernel mostly executes compute instructions with integer operations, which means that the kernel is mostly compute bound.

Intel Advisor automatically determines the data type used in operations and groups the instructions collected during Characterization analysis by the following categories:

Category

Instruction Types

Compute (FLOP and INTOP)
  • BASIC COMPUTE: add, addc, mul, rndu, rndd, rnde, rndz, subb, avg, frc, lzd, fbh, fbl, cbit
  • BIT: and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol
  • FMA: mac, mach, mad, madm

    Note

    Intel Advisor counts mac, mach, mad, madm instructions belonging to this class as 2 operations.
  • DIV: INT_DIV_BOTH, INT_DIV_QUOTIENT, INT_DIV_REMAINDER, and FDIV types of extended math function
  • POW extended math function
  • MATH: other function types performed by math instruction
Memory LOAD, STORE, SLM_LOAD, SLM_STORE types depending on the argument: send, sendc, sends, sendsc
Other
  • MOVE: mov, sel, movi, smov, csel
  • CONTROL FLOW: if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt
  • SYNC: wait, sync
  • OTHER: cmp, cmpn, nop, f32to16, f16to32, dim
Atomic
  • SEND

Get more insights about instructions used in your kernel using Instruction Mix Details pane:

In the Performance Characteristics, review how effectively the kernel uses the GPU resources: activity of all execution units, percentage of time when both FPUs are used, percentage of cycles with a thread scheduled. Ideally, you should see a higher percentage of active execution units and other effectiveness metrics to use more GPU resources.

See Also