Intel® Advisor Help
This reference section describes the contents of data columns in reports of the Offload Modeling and GPU Roofline Insights perspectives.
# | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | XYZ
Description: Average percentage of time when both FPUs are used.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > EU Instructions column group.
Description: Percentage of cycles actively executing instructions on all execution units (EUs).
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > EU Array column group.
Description: Additional information about a code region that might help to understand the achieved performance.
Collected during the Survey in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane.
Description: Total execution time by atomic throughput, in milliseconds.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Average time spent executing one task instance.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Compute Task Details column group.
Prerequisites for display: Expand the Compute Task Details column.
Description: Average time spent executing one task instance. This metric is only available for the GPU-to-GPU modeling.
Collected during the Survey analysis with enabled GPU profiling in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisites for display: Expand the Measured column.
Description: Average number of times a loop/function is executed.
Collected during the Trip Counts (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisites for display: Expand the Measured column group.
Description: Rate at which data is transferred to and from GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Memory column group.
Prerequisite for display: Expand the GPU Memory column. This metric is also shown in the collapsed GPU Memory column.
Description: Rate at which data is transferred between execution units and L3 caches, in gigabytes per second.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > L3 Shader column group.
Prerequisite for display: Expand the L3 Shader column. This metric is also shown in the collapsed L3 Shader column.
Description: Rate at which data is transferred to and from shared local memory (SLM), in gigabytes per second.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions report > GPU pane > SLM column group.
Prerequisites for display: Expand the SLM column. This metric is also shown in the collapsed SLM column.
Description: A host platform that application is executed on.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisite for display: Expand the Measured column group.
Description: List of main factors that limit the estimated performance of a code region offloaded to a target device.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Basic Estimated Metrics column group.
Interpretation: This metric shows one or more bottleneck(s) in a code region.
Category |
Bottleneck |
Description |
---|---|---|
Algorithmic |
Dependencies |
Data dependencies limit the parallel execution efficiency. Fix the dependencies to offload this code region. |
Kernel Decomposition |
The workload decomposition strategy does not allow to schedule enough parallel threads to use all execution units on a selected target device. |
|
Trip Counts |
The number of loop iterations is not enough to use all execution units on a selected target device. |
|
Taxes |
Data Transfer |
Data transfer tax is greater than the sum of the maximum throughput time and latencies time. |
Launch Tax |
Kernel launch tax is greater than the sum of the maximum throughput time and latencies time. |
|
Throughput |
Compute |
The code region uses full target device capabilities, but the compute time is still high. The time is greater than all other execution time components on a target device. |
Global Atomics |
Global atomics bandwidth time is greater than all other execution time components on a target device. |
|
Memory Sub-System bandwidth (BW): for example, L3 BW, LLC BW, DRAM BW |
Memory sub-system bandwidth time is greater than all other execution time components on a target device. |
|
Latencies |
Latencies |
Instruction latency is greater than the maximum throughput time. |
Resulting estimated time is calculated as a sum of the four factors: throughput, latency, and taxes, which include data transfer taxes and submission tax:
Time = max_throughput_bottleneck_time + non_overlaped_latency + data_transfer_time + kernel_submission_taxes_time
The model assumes that throughput-defined times are fully "overlapped" and chooses only a "maximum" throughput bottleneck to show in the column. If the impact of other components is comparable to the throughput component, top bottlenecks of all four factors (one for throughput, one for latency, and one for data transfer/submission) are shown in this column. This means the code region is limited by this combination of factors, which is ordered by the impact on the region performance.
Otherwise, for example, if the relative throughput impact is much higher than the latency and data transfer ones, only the maximum throughput bottleneck is shown as dominating over others. If the maximum throughput time is compute, Intel Advisor assumes the algorithmic factors (dependencies, kernel decomposition, trip counts) limit offloading a code region.
For example, the combined Data Transfer, DRAM BW value means the following:
Description: Fraction of global memory traffic used by execution units.
Collected during the Survey analysis with GPU profiling enabled in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > L3 Shader column group.
Prerequisites for display: Expand the L3 Shader column group. This metric is also shown in the collapsed L3 Shader column.
Calculation: Ratio of global memory traffic to the observed cache traffic, where:
Interpretation: If you see a low value, it may indicate that the kernel has an inefficient or not GPU-friendly memory access pattern.
Description: Number of times a loop/function was invoked.
Collected during the Trip Counts (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisites for display: Expand the Measured column group.
Description: Total data transferred to and from execution units, in gigabytes..
Collected during the Characterization analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Description: Estimated execution time assuming an offloaded loop is bound only by compute throughput.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the Accelerated Regions tab > Code Regions pane.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Name of a compute task.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Description: Average time spent executing one task instance. When collapsed, corresponds to the Average column. Expand to see more metrics.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Description: Action that a compute task performs.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Description: Total number of threads started across all execution units for a computing task.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Description: Estimated time cost, in milliseconds, for transferring loop data between host and target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data is reused between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisites for collection:
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Estimated time cost, in milliseconds, for transferring loop data between host and target platforms considering data is not reused. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisite for collection:
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Difference between data transfer time estimated with data reuse and without data reuse, in milliseconds. This option is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisite for collection:
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Dependency absence or presence in a loop across iterations.
Collected during the Survey and Dependencies analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisite for display: Expand the Measured column group.
Possible values:
Prerequisites for collection/display:
Some values in this column can appear only if you select specific options when collecting data or run the Dependencies analysis:
For Parallel: Workload and Dependency: <dependency-type>:
For Parallel: User:
For Dependency: User:
For Parallel: Assumed:
For Dependencies: Assumed:
Interpretation:
Description: Total data transferred from device to host.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Data Transferred column group.
Prerequisites for display: Expand the Data Transferred column group.
Description: Total time spent on transferring data from device to host.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Data Transferred column group.
Prerequisites for display: Expand the Data Transferred column group.
Description: Summary of estimated DRAM memory usage, including DRAM bandwidth, in gigabytes per second, and total DRAM traffic calculated as sum of read and write traffic.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Description: DRAM Bandwidth. Estimated time, in seconds, spent on reading from DRAM memory and writing to DRAM memory assuming a maximum DRAM memory bandwidth is achieved.
Collected during the Trip Counts analysis (Characterization) and the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: DRAM Bandwidth. Estimated rate at which data is transferred to and from the DRAM, in gigabytes per second.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated DRAM bandwidth utilization, in per cent.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Memory Estimations column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Calculation: Ratio of average bandwidth to a maximum theoretical bandwidth.
Description: Total estimated data read from the DRAM memory.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated sum of data read from and written to the DRAM memory.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total estimated data written to the DRAM memory.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Wall-clock time from beginning to end of computing task execution.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Description: Percentage of cycles on all execution units (EUs) and thread slots when a slot has a thread scheduled.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Description: Summary of data read from a target platform and written to the target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on the target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane.
Prerequisites for collection:
Description: Number of fill instructions used to read data values spilled from GRF into memory (L3 cache).
Collected during the Trip Counts analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Register Spilling column group.
Prerequisites for display: Expand the Register Spilling column group.
Interpretation: A high number of memory spill/fill (or load/store) operations significantly increases memory traffic and decreases the performance.
Description: Summary of floating-point operations in a kernel.
Collected during the Characterization analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Compute Performance column group.
Aggregation:
You can hover over each value in the cell to see the value description.
Description: Ratio of floating-point operations to bytes transferred to GPU memory.
Collected during the Characterization analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Compute Performance column group.
Prerequisites for display: Expand the GPU Compute Performance column group. This metric is also shown in the collapsed FLOAT Operations column.
Description: Percentage of time spent in code regions profitable for offloading in relation to the total execution time of the region.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Basic Estimated Metrics column group.
Prerequisites for display: Expand the Basic Estimated Metrics column group.
Interpretation: 100% means there are no non-offloaded child regions, calls to parallel runtime libraries, or system calls in the region.
Description: Estimated data transferred from a target platform to a shared memory by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Data Transfer with Reuse column group.
Prerequisites for collection:
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.
Description: Number of giga floating-point operations.
Collected during the Characterization analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Compute Performance column group.
Instruction types counted: BASIC COMPUTE, FMA, BIT, DIV, POW, MATH.
Prerequisites for display: Expand the GPU Compute Performance column group. This metric is also shown in the collapsed FLOAT Operations column.
Description: Number of giga floating-point operations per second.
Collected during the Characterization analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Compute Performance column group.
Instruction types counted: BASIC COMPUTE, FMA, BIT, DIV, POW, MATH.
Prerequisites for display: Expand the GPU Compute Performance column group. This metric is also shown in the collapsed FLOAT Operations column.
Description: Number of giga integer operations.
Collected during the Characterization analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Compute Performance column group.
Instruction types counted: BASIC COMPUTE, FMA, BIT, DIV, POW, MATH.
Prerequisites for display: Expand the GPU Compute Performance column group. This metric is also shown in the collapsed INT Operations column.
Description: Number of giga integer operations per second.
Collected during the Characterization analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Compute Performance column group.
Instruction types counted: BASIC COMPUTE, FMA, BIT, DIV, POW, MATH.
Prerequisites for display: Expand the GPU Compute Performance column group. This metric is also shown in the collapsed INT Operations column.
Description: Total number of work items in all work groups.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Work Size column group.
Description: Total estimated number of work items in a loop executed after offloaded on a target platform.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Compute Estimates column group.
Prerequisite for display: Expand the Compute Estimates column group.
Description: Total number of work items in a kernel instance on a baseline device. This metric is only available for the GPU-to-GPU modeling.
Collected during the Survey analysis with enabled GPU profiling in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisite for display: Expand the Measured column group.
Description: Summary of GPU memory usage in a kernel. GPU memory is data transferred to and from GPU, chip uncore (LLC), and main memory.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Aggregation: The column reports the following metrics:
You can hover over each value in the cell to see the value description.
Description: Total number of shader atomic memory accesses.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Description: Total number of shader barrier messages.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Description: Summary of estimated GTI memory usage, including GTI bandwidth, in gigabytes per second, and total GTI traffic calculated as sum of read and write traffic.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Description: Graphics technology interface (GTI) Bandwidth. Estimated time, in seconds, spent on reading from and writing to GTI memory assuming a maximum GTI memory bandwidth is achieved.
Collected during the Trip Counts analysis (Characterization) and the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Graphics technology interface (GTI) Bandwidth. Estimated rate at which data is transferred to and from the GTI, in gigabytes per second.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Graphics technology interface (GTI) bandwidth utilization. Estimated GTI bandwidth utilization, in per cent.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Memory Estimations column group in the Code Regions pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Calculation: Ratio of average bandwidth to a maximum theoretical bandwidth.
Description: Total estimated data read from the GTI memory.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated sum of data read from and written to the GTI memory.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total estimated data written to the GTI memory.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total data transferred from host to device.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Data Transferred column group.
Prerequisites for display: Expand the Data Transferred column group.
Description: Total time spent on transferring data from host to device.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Data Transferred column group.
Prerequisites for display: Expand the Data Transferred column group.
Description: Percentage of cycles on all execution units (EU), when no threads are scheduled on a EU.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions report > GPU pane > EU Array column group.
Description: Time spent in system calls and calls to ignored modules or parallel runtime libraries in the code regions recommended for offloading.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Non-User Code Metrics column group.
Prerequisites for collection: From CLI, run the --collect=projection action with the ignore=<code-to-ignore> action option. For example, to ignore MPI and OpenMP* calls, use the flag as follows: --ignore=MPI,OMP.
Prerequisite for display: Expand the Time in Non-User Code column group.
Interpretation: Time in the ignored code parts is not used for the : estimations. It does not affect time estimated for offloaded code regions.
Description: Total number of times a task executes on a GPU.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Compute Task Details column group.
Prerequisite for display: Expand the Compute Task Details column group.
Description: Total estimated number of times a loop executes on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Compute Estimates column group.
Prerequisite for display: Expand the Compute Estimates column group.
Description: Total number of times a loop executes on a baseline GPU device.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Description: Ratio of integer operations to transferred bytes.
Collected during the Characterization with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions report > GPU pane > GPU Compute Performance column group.
Instruction types counted: BASIC COMPUTE, FMA, BIT, DIV, POW, MATH.
Prerequisites for display: Expand the GPU Compute Performance column group. This metric is also shown in the INT Operations column when the group is collapsed.
Description: Summary of integer operations used in a kernel.
Collected during the Characterization analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Compute Performance column group.
Aggregation:
You can hover over each value in the cell to see the value description.
Description: Average rate of instructions per cycle (IPC) calculated for two FPU pipelines.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > EU Instructions column group.
Description: Summary of iteration metrics measured on a baseline device.
Collected during the Trip Counts (Characterization) analysis (for CPU regions) or the Survey analysis with enabled GPU profiling (for GPU regions) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Aggregation: For the CPU-to-GPU modeling, this column reports the following metrics:
For the GPU-to-GPU modeling, this column reports the following metrics:
Description: Total estimated time cost for invoking a kernel when offloading a loop to a target platform. Does not include data transfer costs.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Top uncovered latency in a loop/function, in milliseconds.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Description: L3 Bandwidth. Estimated time, in seconds, spent on reading from L3 cache and writing to L3 cache assuming a maximum L3 cache bandwidth is achieved.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Summary of estimated L3 cache usage, including L3 cache bandwidth (in gigabytes per second) and L3 cache traffic calculated as sum of read and write traffic.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Description: Average estimated rate at which data is transferred to and from the L3 cache, in gigabytes per second.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated L3 cache bandwidth utilization, in per cent, calculated as ratio of average bandwidth to a maximum theoretical bandwidth.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total estimated data read from the L3 cache.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated sum of data read from and written to the L3 cache.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total estimated data written to the L3 cache.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Summary of L3 cache usage in a kernel.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Aggregation: The column reports the following metrics:
You can hover over each value in the cell to see the value description and interpretation hints.
Description: Estimated last-level cache (LLC) usage, including LLC cache bandwidth (in gigabytes per second) and total LLC cache traffic, which is a sum of read and write traffic.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Description: Last-level cache (LLC) bandwidth. Estimated time, in seconds, spent on reading from LLC and writing to LLC assuming a maximum LLC bandwidth is achieved.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Estimated rate at which data is transferred to and from the LLC cache, in gigabytes per second.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated LLC cache bandwidth utilization, in per cent.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Calculation: Ratio of average bandwidth to a maximum theoretical bandwidth.
Description: Total estimated data read from the LLC cache.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated sum of data read from and written to the LLC cache.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total estimated data written to the LLC cache.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Uncovered cache or memory load latencies uncovered in a code region, in milliseconds.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisite for display: Estimated Bounded By column group.
Description: Number of work items in one work group.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions report > GPU pane > Work Size column group.
Description: Local memory size used by each thread group.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Compute Task Details column group.
Prerequisite for display: Expand the Compute Task Details column group.
Description: Total estimated number of work items in one work group of a loop executed after offloaded on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Compute Estimates column group.
Prerequisite for display: Expand the Compute Estimates column group.
Description: Total number of work items in one work group of a kernel. This metric is only available for the GPU-to-GPU modeling.
Collected during the Survey analysis with enabled GPU profiling in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisite for display: Expand the Measured column group.
Description: Name and source location of a loop/function in a region, where region is a sub-tree of loops/functions in a call tree.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane.
Description: Size of unique data (variables) spilled from general register file (GRF) per thread, in bytes.
Collected during the Trip Counts analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Register Spilling column group.
Prerequisites for display: Expand the Register Spilling column group. This metric is also shown in the collapsed Register Spilling column.
Interpretation: Higher value indicates that register spilling decreases performance.
Description: Total memory traffic between general register file (GRF) and L3 caused by the register spilling, in percentage of total traffic.
Collected during the Trip Counts analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Register Spilling column group.
Prerequisites for display: Expand the Register Spilling column group.
Interpretation: The lower the ratio is, the better the kernel is optimized. If you see a high value, it means that spill/fill traffic takes up a big part of total traffic and may significantly decrease kernel performance.
Calculation: Ratio of total spill/fill traffic to the total observed cache traffic.
Description: Program module name.
Collected during the Survey in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Location column group.
Prerequisites for display: Expand the Location column group.
Description: Total time spent for transferring data and launching kernel, in milliseconds.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Conclusion that indicates whether a code region is profitable for offloading to a target platform. In the Top-Down pane, it also reports the node position, such as offload child loops and child functions.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Basic Estimated Metrics column group.
Description: Total estimated time spent in non-offloaded parts of offloaded code regions.
Collected during the Survey and Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Time in Non-User Code column group.
Calculation: This column is a sum of the following metrics:
Interpretation: These code parts are located inside offloaded regions, but the performance model assumes these parts are executed on a baseline device. Examples of such code parts are OpenMP* code parts, Data Parallel C++ (DPC++) runtimes, and system calls.
Description: Number of loop iterations or kernel work items executed in parallel on a target device for a loop/function.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Compute Estimates column group.
Description: Estimated number of threads scheduled simultaneously on all execution units (EU).
Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Accelerated Regions tab > Code Regions pane > Compute Estimates column group.
Prerequisites for display: Expand the Compute Estimates column group.
Description: Performance issues and recommendations for optimizing code regions executed on a GPU.
Collected during the Survey, Characterization, and Performance Modeling analyses in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane .
Interpretation: Click to view the full recommendation text with code examples and recommended fixes in the Recommendations pane of the GPU Roofline Regions tab.
Description: Recommendations for offloading code regions with estimated performance summary and/or potential issues with optimization hints.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane.
Interpretation: Click to view the full recommendation text with examples of using DPC++ and OpenMP* programming modeling to offload the code regions and/or fix the performance issue in the Recommendations pane of the Accelerated Regions tab.
Description: Total estimated data transferred to a private memory from a target platform by a loop. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Data Transfers with Reuse column group.
Prerequisite for collection:
Prerequisite for display: Expand the Estimated Data Transfers with Reuse column group.
Description: Private memory size allocated by a compiler to each thread.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Compute Task Details column group.
Prerequisite for display: Expand the Compute Task Details column group.
Description: Estimated data read from a target platform by an offload region, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Data Transfers with Reuse column group.
Prerequisite for collection:
Prerequisite for display: Expand the Estimated Data Transfers with Reuse column group.
Description: Total data read from GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Memory column group.
Prerequisites for display: Expand the GPU Memory column group.
Description: Total data read, or filled, from L3 memory due to register spilling, in gigabytes.
Collected during the Trip Counts analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Register Spilling column group.
Prerequisites for display: Expand the Register Spilling column group.
Description: Total data read from the shared local memory (SLM), in gigabytes.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > SLM column group.
Prerequisites for display: Expand the SLM column group.
Description: Rate at which data is read from GPU, chip uncore (LLC), and main memory, in gigabytes per second.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Memory column group.
Prerequisites for display: Expand the GPU Memory column group.
Description: Rate at which data is read from shared local memory (SLM), in gigabytes per second.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions report > GPU pane > SLM column group.
Prerequisites for display: Expand the SLM column group.
Description: Estimated data read from a target platform by a code region considering no data is reused between kernels, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Data Transfers with Reuse column group.
Prerequisite for collection:
Prerequisite for display: Expand the Estimated Data Transfers with Reuse column group.
Description: Summary of register spilling impact on kernel performance
Collected during the Trip Counts analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Interpretation: Register spilling occurs when a thread block (or work item) needs more space in the general register file (GRF) than is available, and data is loaded, or spilled, into memory through L3 cache. Next time this data is needed, application has to read, or fill, it from the L3 cache memory, which causes more memory operation. As a result, when register spilling occurs in a kernel, it decreases its performance.
For the best performance, there should be no spills in the kernel.
Aggregation:
Description: Percentage of cycles on all execution units when execution unit send pipeline is actively processed.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > EU Instructions column group.
Description: Number of work items processed by a single GPU thread.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions report > GPU pane > Compute Task Details column group.
Prerequisites for display: Expand the Compute Task Details column group.
Description: Estimated number of work items processed by a single thread on a target platform.
Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Accelerated Regions tab > Code Regions pane > Compute Estimates column group.
Prerequisites for display: Expand the Compute Estimates column group.
Description: Number of work items processed by a single thread on a baseline device. This metric is only available for the GPU-to-GPU modeling.
Collected during the Survey analysis with enabled GPU profiling analysis in the Offload Modeling perspective andfound in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisites for display: Expand the Measured column group.
Description: Summary of shared local memory (SLM) usage in a kernel.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane.
Aggregation: The column reports the following metrics:
You can hover over each value in the cell to see the value description.
Description: Summary of estimated SLM usage, including SLM bandwidth, in gigabytes per second, and SLM traffic calculated as sum of read and write traffic.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Description: Shared Local Memory (SLM) bandwidth. Estimated time, in seconds, spent on reading from SLM and writing to SLM assuming a maximum SLM bandwidth is achieved.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Prerequisites for collection:
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Shared Local Memory (SLM) bandwidth. Average estimated rate at which data is transferred to and from the SLM. This is a dynamic value, and depending on the bandwidth value, it can be measured in bytes per second, kilobytes per second, megabytes per second, and so on.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated shared local memory (SLM) bandwidth utilization, in per cent.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Calculation: Ratio of average bandwidth to a maximum theoretical bandwidth.
Description: Total estimated data read from the SLM.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated sum of data read from and written to the shared local memory (SLM).
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total estimated data written to shared local memory (SLM).
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Source file name and line number.
Collected during the Survey in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pan > Location column group.
Interpretation: Use this column to understand where a code region is located.
Description: Number of spill instructions used to load data values from general register file (GRF) into memory (L3 cache).
Collected during the Trip Counts analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Register Spilling column group.
Prerequisites for display: Expand the Register Spilling column group.
Interpretation: A high number of memory spill/fill (or load/store) operations significantly increases memory traffic and decreases the performance.
Description: Percentage of cycles on all execution units (EUs) when at least one thread is scheduled, but the EU is stalled.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > EU Array column group.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Compute Task Details column group.
Prerequisites for display: Expand the Compute Task Details column group.
Description: Estimated speedup for a loop offloaded to a target device, in comparison to the original elapsed time.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Basic Estimated Metrics column group.
Interpretation: If the speedup is more than 1, the code region is recommended for offloading to a target device. If the speedup is equal to or less than 1, the code region is not recommended for offloading.
Description: The highest estimated time cost and a sum of all other costs for offloading a loop from host to a target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform. A triangle icon in a table cell indicates that this region reused data.
This decreases the estimates data transfer tax.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Description: Average percentage of thread slots occupied on all execution units estimated on a target device.
Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Accelerated Regions tab > Code Regions pane > Compute Estimates column group.
Prerequisites for display: Expand the Compute Estimates column group.
Description: Average percentage of thread slots occupied on all execution units measured on a baseline device. This metric is only available for the GPU-to-GPU modeling.
Collected during the Survey analysis with GPU profiling in the Offload Modeling perspective andfound in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisites for display: Expand the Measured column group.
Description: Estimated number of threads scheduled simultaneously per execution unit (EU).
Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Accelerated Regions tab > Code Regions pane > Compute Estimates column group.
Prerequisites for display: Expand the Compute Estimates column group.
Description: Top two factors that a loop/function is bounded by, in milliseconds.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Bounded By column group.
Description: Estimated elapsed wall-clock time from beginning to end of loop execution estimated on a target platform after offloading.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Basic Estimated Metrics column group.
Description: Elapsed wall-clock time from beginning to end of loop execution measured on a host platform.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Description: Estimated time, in seconds, spent on reading from DRAM memory and writing to DRAM memory assuming a maximum DRAM memory bandwidth is achieved.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated time, in seconds, spent on reading from graphics technology interface (GTI) and writing to GTI assuming a maximum GTI bandwidth is achieved.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated time, in seconds, spent on reading from L3 cache and writing to L3 cache assuming a maximum L3 cache bandwidth is achieved.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated time, in seconds, spent on reading from last-level cache (LLC) and writing to LLC assuming a maximum LLC bandwidth is achieved.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated time, in seconds, spent on reading from shared local memory (SLM) and writing to SLM assuming a maximum SLM bandwidth is achieved.
Collected during the Trip Counts (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Memory Estimations column group.
Prerequisites for collection:
Prerequisites for display: Expand the Memory Estimations column group.
Description: Estimated data transferred to a target platform from a shared memory by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Data Transfer with Reuse column group.
Prerequisites for collection:
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.
Description: Sum of estimated data transferred both to/from a shared memory to/from a target platform by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Data Transfer with Reuse column group.
Prerequisites for collection:
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.
Description: Sum of the total estimated traffic incoming to a target platform and the total estimated traffic outgoing from the target platform, for an offload loop, in megabytes.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Data Transfer with Reuse column group.
Prerequisites for collection:
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.
Calculation: (MappedTo + MappedFrom + 2*MappedToFrom). If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Description: Total data transferred to and from GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Memory column group.
Prerequisite for display: Expand the GPU Memory column. This metric is also shown in the collapsed GPU Memory column.
Description: Total data transferred between execution units and L3 cache, in gigabytes.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > L3 Shader column group.
Prerequisites for display: Expand the L3 Shader column. This metric is also shown in the collapsed L3 Shader column.
Description: Total data transferred to and from the shared local memory (SLM).
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > SLM column group.
Prerequisites for display: Expand the SLM column. This metric is also shown in the collapsed SLM column.
Description: Average data transfer bandwidth between CPU and GPU.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Data Transferred column group.
Prerequisites for display: Expand the Data Transferred column group.
Interpretation: In some cases, such as clEnqueueMapBuffer, data transfers might generate high bandwidth because memory is not copied but shared using L3 cache.
Description: Total data processed on a GPU.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Data Transferred column group.
Description: Total time spent executing a task.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Compute Task Details column group.
Prerequisites for display: Expand the Compute Task Details column group.
Description: Total time spent in Intel® Data Analytics Acceleration Library (Intel® DAAL) calls in an offloaded code region, in seconds.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Time in Non-User Code column group.
Prerequisites for display: Expand the Time in Non-User Code column group.
Interpretation: If the value in the column is more than 0, the code region contains Intel DAAL calls.
Description: Total time spent in Data Parallel C++ (DPC++) calls in an offloaded code region, in seconds.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Time in Non-User Code column group.
Prerequisites for display: Expand the Time in Non-User Code column group.
Interpretation: If the value in the column is more than 0, the code region contains DPC++ calls.
Description: Total time spent in MPI calls in an offloaded code region, in seconds.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Time in Non-User Code column group.
Interpretation: If the value in the column is more than 0, the code region contains MPI calls.
Description: Total time spent in OpenCL™ calls in an offloaded code region, in seconds.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Time in Non-User Code column group.
Prerequisites for display: Expand the Time in Non-User Code column group.
Interpretation: If the value in the column is more than 0, the code region contains OpenCL calls.
Description: Total time spent in OpenMP* calls in an offloaded code region, in seconds.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Time in Non-User Code column group .
Prerequisites for display: Expand the Time in Non-User Code column group.
Interpretation: If the value in the column is more than 0, the code region contains OpenMP calls.
Description: Total time spent in system calls in an offloaded code region, in seconds.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Time in Non-User Code column group.
Prerequisites for display: Expand the Time in Non-User Code column group.
Interpretation: If the value in the column is more than 0, the code region contains system calls.
Description: Total time spent in Intel® oneAPI Threading Building Blocks (oneTBB) calls in an offloaded code region, in seconds.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Time in Non-User Code column group.
Prerequisites for display: Expand the Time in Non-User Code column group.
Interpretation: If the value in the column is more than 0, the code region contains oneTBB calls.
Description: Total data spilled to and filled from L3 memory due to register spilling, in gigabytes.
Collected during the Trip Counts analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Register Spilling column group.
Prerequisites for display: Expand the Register Spilling column group.
Interpretation: High value indicates that spill/fill traffic might take a big part of the total data traffic in the kernel and decrease its performance. See the Memory Impact column to understand how much of total traffic it is.
Calculation: A sum of data spilled from general register file (GRF) to L3 and filled from L3 to GRF.
Description: Total number of times a loop/function is executed.
Collected during the Trip Counts (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisites for display: Expand the Measured column group.
Description: Sum of the total estimated traffic incoming to a target platform and the total estimated traffic outgoing from the target platform considering no data is reused, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Data Transfer with Reuse column group.
Prerequisite for collection:
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.
Calculation: (MappedTo + MappedFrom + 2*MappedToFrom).
Description: Loop unroll factor applied by the compiler.
Collected during the Survey in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane >Measured column group.
Prerequisites for display: Expand the Measured column group.
Description: The highest vector instruction set architecture (ISA) used for individual instructions.
Collected during the Survey in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisites for display: Expand the Measured column group.
Description: Number of elements processed in a single iteration of vector loops or the number of elements processed in individual vector instructions determined by a binary static analysis or an Intel® Compiler.
Collected during the Survey in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Measured column group.
Prerequisites for display: Expand the Measured column group.
Description: Reason why a code region is not recommended for offloading to a target GPU device.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Basic Estimated Metrics column group.
Interpretation: See Investigate Non-Offloaded Code Regions for details about available reasons.
Description: Estimated data written to a target platform by a loop. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Data Transfer with Reuse column group.
Prerequisite for collection:
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.
Description: Total data written to GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Memory column group.
Prerequisites for display: Expand the GPU Memory column group.
Description: Total data written, or spilled, to L3 memory due to register spilling, in gigabytes.
Collected during the Trip Counts analysis with GPU profiling in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > Register Spilling column group.
Prerequisites for display: Expand the Register Spilling column group.
Description: Total data written to the shared local memory (SLM), in gigabytes.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > SLM column group.
Prerequisites for display: Expand the SLM column group.
Description: Rate at which data is written to GPU, chip uncore (LLC), and main memory, in gigabytes per second.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > GPU Memory column group.
Prerequisites for display: Expand the GPU Memory column group.
Description: Rate at which data is written to shared local memory (SLM), in gigabytes per second.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU Roofline Regions tab > GPU pane > SLM column group.
Prerequisites for display: Expand the SLM column group.
Description: Estimated data written to a target platform by a code region considering no data is reused, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Accelerated Regions tab > Code Regions pane > Estimated Data Transfer with Reuse column group.
Prerequisite for collection:
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.