Intel® Advisor Help

Accelerator Metrics

This reference section describes the contents of data columns in reports of the Offload Modeling and GPU Roofline Insights perspectives.

# | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | XYZ

#

2 FPUs Active

Description: Average percentage of time when both FPUs are used.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Instructions column group in the GPU pane of the GPU Roofline Regions tab.

A

Active

Description: Percentage of cycles actively executing instructions on all execution units (EUs).

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Array column group in the GPU pane of the GPU Roofline Regions tab.

Atomic Throughput

Description: Total execution time by atomic throughput, in milliseconds.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane.

Prerequisite for display:

the Estimated Bounded By column group.

Average Time

Description: Average amount of time spent executing one task instance.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Compute Task Details column group in the GPU pane of the GPU Roofline Regions tab.

Prerequisites for display: Expand the Compute Task Details column.

B

Bandwidth, GB/sec (GPU Memory)

Description: Rate at which data is transferred to and from GPU, chip uncore (LLC), and main memory, in gigabytes.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions tab.

Bandwidth, GB/sec (L3 Shader)

Description: Rate at which data is transferred between execution units and L3 caches, in gigabytes per second.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the L3 Shader column group in the GPU pane of the GPU Roofline Regions report.

Bandwidth, GB/s (Shared Local Memory)

Description: Rate at which data is transferred to and from shared local memory, in gigabytes per second.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions report.

C

Compute

Description: Estimated execution time assuming an offloaded loop is bound only by compute throughput.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Estimated Bounded By column group.

Compute Task

Description: Name of a compute task.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.

Compute Task Details

Description: Average amount of time spent executing one task instance. When collapsed, corresponds to the Average column. Expand to see more metrics.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.

Compute Task Purpose

Description: Action that a compute task performs.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.

Computing Threads Started

Description: Total number of threads started across all execution units for a computing task.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.

D

Data Transfer Tax

Description: Estimated time cost, in milliseconds, for transferring loop data between host and target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisite for display: Expand the Estimated Bounded By column group.

Data Transfer Tax without Reuse

Description: Estimated time cost, in milliseconds, for transferring loop data between host and target platform considering no data is reused. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.

Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for collection:

Prerequisite for display: Expand the Estimated Bounded By column group.

Data Reuse Gain

Description: Difference, in milliseconds, between data transfer time estimated with data reuse and without data reuse. This option is available only if you enabled the data reuse analysis for the Performance Modeling.

Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for collection:

Prerequisite for display: Expand the Estimated Bounded By column group.

Dependency Type

Description: Dependency absence or presence in a loop across iterations.

Collected during the Survey and Dependencies analyses in the Offload Modeling perspective and found in the Measured column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Measured column group.

Possible values:

Prerequisites for collection/display:

Some values in this column can appear only if you select specific options when collecting data or run the Dependencies analysis:

For Parallel: Workload and Dependency: <dependency-type>:

For Parallel: User:

For Dependency: User:

For Parallel: Assumed:

For Dependencies: Assumed:

Interpretation:

Device

Description: A host platform that application is executed on.

Collected during the Survey analysis in the Offload Modeling perspective and found in the Measured column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Measured column group.

DRAM

Description: Summary of DRAM memory usage, including DRAM bandwidth (in gigabytes per second) and total DRAM traffic, which is a sum of read and write traffic.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

DRAM BW (Estimated Bounded By)

Description: DRAM Bandwidth. Estimated execution time, in seconds, assuming an offloaded loop is bound only by DRAM memory throughput.

Collected during the Trip Counts analysis (Characterization) and the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Estimated Bounded By column group.

DRAM BW (Memory Estimates)

Description: DRAM Bandwidth. Rate at which data is transferred to and from the DRAM, in gigabytes per second.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

DRAM BW Utilization

Description: DRAM bandwidth utilization, in per cent.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

DRAM Read Traffic

Description: Total data read from the DRAM memory.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

DRAM Traffic

Description: A sum of data read from and written to the DRAM memory.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

DRAM Write Traffic

Description: Total data written to the DRAM memory.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

E

Elapsed Time

Description: Wall-clock time from beginning to end of computing task execution.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.

EU Threading Occupancy

Description: Percentage of cycles on all execution units (EUs) and thread slots when a slot has a thread scheduled.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.

Estimated Data Transfers with Reuse

Description: Summary of data read from a target platform and written to the target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on the target platform.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

F

FP AI

Description: Ratio of FLOP to the number of transferred bytes.

Collected during the FLOP analysis (Characterization) enabled in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions tab.

Fraction of Offloads

Description: A percentage of time spent in code regions profitable for offloading in relation to the total execution time of the region.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Basic Estimated Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for display: Expand the Basic Estimated Metrics column group.

Interpretation: 100% means there are no non-offloaded child regions, calls to parallel runtime libraries, or system calls in the region.

From Target

Description: Estimated data transferred from a target platform to a shared memory by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Data Transfer with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.

G

GFLOP

Description: Number of giga floating-point operations.

Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions tab.

Instruction types counted during Characterization collection:

GFLOPS

Description: Number of giga floating-point operations per second.

Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions tab.

Instruction types counted during Characterization collection:

GINTOP

Description: Number of giga integer operations.

Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions tab.

Instruction types counted during Characterization collection:

GINTOPS

Description: Number of giga integer operations per second.

Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions tab.

Instruction types counted during Characterization collection:

Global

Description: Total number of work items in all work groups.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Work Size column group in the GPU pane of the GPU Roofline Regions tab.

Global Size

Description: Total estimated number of work items in a loop executed after offloaded on a target platform.

Collected during the Trip Counts analysis (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Compute Estimates column group.

GPU Shader Atomics

Description: Total number of shader atomic memory accesses.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.

GPU Shader Barriers

Description: Total number of shader barrier messages.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.

H

I

Idle

Description: Percentage of cycles on all execution units (EUs), during which no threads are scheduled on a EU.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Array column group in the GPU pane of the GPU Roofline Regions report.

Instances

Description: Total estimated number of times a loop executes on a target platform.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Compute Estimates column group.

Instance Count

Description: Total number of times a task is executed.

Collected during the Trip Counts analysis (Characterization) analysis in the GPU Roofline Insights perspective and found in the Compute Task Details column group in the GPU pane of the GPU Roofline Regions report.

Prerequisites for display: Expand the Compute Task Details column.

INT AI

Description: Ratio of INTOP to the number of transferred bytes.

Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions report.

Instruction types counted during Characterization collection:

IPC Rate

Description: Average rate of instructions per cycle (IPC) calculated for two FPU pipelines.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Instructions column group in the GPU pane of the GPU Roofline Regions report.

J

K

Kernel Launch Tax

Description: Total estimated time cost for invoking a kernel when offloading a loop to a target platform. Does not include data transfer costs.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Estimated Bounded By column group.

L

Latencies

Description: Top uncovered latency in a loop/function, in milliseconds.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

L3 BW

Description: L3 Bandwidth. Estimated execution time, in seconds, assuming an offloaded loop is bound only by L3 cache throughput.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Estimated Bounded By column group.

L3 Cache

Description: Summary of L3 cache usage, including L3 cache bandwidth (in gigabytes per second) and L3 cache traffic, which is a sum of read and write traffic.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

L3 Cache BW

Description: Average rate at which data is transferred to and from the L3 cache, in gigabytes per second.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

L3 Cache BW Utilization

Description: L3 cache bandwidth utilization, in per cent.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

L3 Cache Line Utilization

Description: L3 cache line utilization for data transfer, in percentage.

Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the CARM (EU <-> Data Port) column group in the GPU pane of the GPU Roofline Regions tab.

L3 Cache Read Traffic

Description: Total data read from the L3 cache.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

L3 Cache Traffic

Description: A sum of data read from and written to the L3 cache.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

L3 Cache Write Traffic

Description: Total data written to the L3 cache.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

LLC BW

Description: Last-level cache (LLC) bandwidth. Estimated execution time, in seconds, assuming an offloaded loop is bound only by LLC throughput.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Estimated Bounded By column group.

LLC Cache

Description: Last-level cache (LLC) usage, including LLC cache bandwidth (in gigabytes per second) and total LLC cache traffic, which is a sum of read and write traffic.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

LLC Cache BW

Description: Rate at which data is transferred to and from the LLC cache, in gigabytes per second.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

LLC Cache BW Utilization

Description: LLC cache bandwidth utilization, in per cent.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

LLC Cache Read Traffic

Description: Total data read from the LLC cache.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

LLC Cache Traffic

Description: A sum of data read from and written to the LLC cache.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

LLC Cache Write Traffic

Description: Total data written to the LLC cache.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

Load Latency

Description: Uncovered cache or memory load latencies uncovered in a code region, in milliseconds.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane.

Prerequisite for display: Estimated Bounded By column group.

Local

Description: Number of work items in one work group.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Work Size column group in the GPU pane of the GPU Roofline Regions report.

Local Size

Description: Total estimated number of work items in one work group of a loop executed after offloaded on a target platform.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Compute Estimates column group.

Loop/Function

Description: Name and source location of a loop/function in a region, where region is a sub-tree of loops/functions in a call tree.

Collected during the Survey analysis in the Offload Modeling perspective.

M

N

Non-Accelerable Time

Description: Time spent in non-offloaded parts of the code regions recommended for offloading.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Non-User Code Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.

O

Offload Tax

Description: Total time spent for transferring data and launching kernel, in milliseconds.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Estimated Bounded By column group.

Offload Summary

Description: Recommendation that indicates if a loop is profitable for offloading to a target platform.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Basic Estimated Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.

P

Parallel Factor

Description: Number of loop iterations or kernel work items executed in parallel on a target device for a loop/function.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.

Parallel Threads

Description: Estimated number of threads scheduled simultaneously on all execution units (EU).

Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for display: Expand the Compute Estimates column group.

Performance Issues

Description: Recommendations for offloading code regions with estimated performance summary and/or potential issues with optimization hints. Each recommendation also includes examples of using DPC++ and OpenMP* programming modeling to offload the code regions and/or fix the performance issue.

Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the CPU+GPU pane of the Accelerated Regions tab.

Private

Description: Total estimated data transferred to a private memory from a target platform by a loop. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Data Transfers with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for collection:

Prerequisite for display: Expand the Estimated Data Transfers with Reuse column group.

Programming Model

Description: Programming model used in a loop/function, if any.

Collected during the Survey analysis in the Offload Modeling perspective and found in the Measured column group in the CPU+GPU pane.

Prerequisite for display: Expand the Measured column group.

Q

R

Read

Description: Estimated data read from a target platform by an offload region, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.

Collected during the Trip Counts analysis (Characterization) analysis in the Offload Modeling perspective and found in the Estimated Data Transfers with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for collection:

Prerequisite for display: Expand the Estimated Data Transfers with Reuse column group.

Read, GB (GPU Memory)

Description: Total amount of data read from GPU, chip uncore (LLC), and main memory, in gigabytes.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions tab.

Prerequisites for display: Expand the GPU Memory column group.

Read, GB (Shared Local Memory)

Description: Total amount of data read from the shared local memory, in gigabytes.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions tab.

Prerequisites for display: Expand the Shared Local Memory column group.

Read, GB/s (GPU Memory)

Description: Rate at which data is read from GPU, chip uncore (LLC), and main memory, in gigabytes per second.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions report.

Prerequisites for display: Expand the GPU Memory column group.

Read, GB/s (Shared Local Memory)

Description: Rate at which data is read from shared local memory, in gigabytes per second.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions report.

Prerequisites for display: Expand the Shared Local Memory column group.

Read without Reuse

Description: Estimated data read from a target platform by a code region considering no data is reused between kernels, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.

Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Data Transfers with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for collection:

Prerequisite for display: Expand the Estimated Data Transfers with Reuse column group.

S

Send Active

Description: Percentage of cycles on all execution units when EU Send pipeline is actively processed.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Instructions column group in the GPU pane of the GPU Roofline Regions report.

SIMD Width (GPU Roofline)

Description: The number of work items processed by a single GPU thread.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Compute Task Details column group in the GPU pane of the GPU Roofline Regions report.

SIMD Width (Offload Modeling)

Description: Estimated number of work items processed by a single thread on a target platform.

Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for display: Expand the Compute Estimates column group.

Stalled

Description: Percentage of cycles on al execution units (EUs) when at least one thread is scheduled, but the EU is stalled.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Array column group in the GPU pane of the GPU Roofline Regions report.

SVM Usage Type

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Compute Task Details column group in the GPU pane of the GPU Roofline Regions report.

Speed-up

Description: Estimated speedup after a loop is offloaded to a target device, in comparison to the original elapsed time.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Basic Estimated Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.

T

Taxes with Reuse

Description: The highest estimated time cost and a sum of all other costs for offloading a loop from host to a target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform. A triangle icon in a table cell indicates that this region reused data.

This decreases the estimates data transfer tax.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Thread Occupancy

Description: Average percentage of thread slots occupied on all execution units estimated on a target device.

Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for display: Expand the Compute Estimates column group.

Threads per EU

Description: Estimated number of threads scheduled simultaneously per execution unit (EU).

Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for display: Expand the Compute Estimates column group.

Throughput

Description: Top two factors that a loop/function is bounded by, in milliseconds.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane.

Time (Estimated)

Description: Elapsed wall-clock time from beginning to end of loop execution measured on a host platform.

Collected during the Survey analysis in the Offload Modeling perspective and found in the Basic Estimated Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.

Time (Measured)

Description: Estimated elapsed wall-clock time from beginning to end of loop execution estimated on a target platform after offloading.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Measured column group in the CPU+GPU pane of the Accelerated Regions tab.

Time by DRAM BW

Description: Loop/function execution time bounded by DRAM bandwidth, in seconds.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

Time by L3 Cache BW

Description: Loop/function execution time bounded by L3 cache bandwidth, in seconds.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

Time by LLC Cache BW

Description: Loop/function execution time bounded by LLC cache bandwidth, in seconds.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisites for display: Expand the Memory Estimations column group.

Time in Ignored

Description: Time spent in system calls and calls to ignored modules or parallel runtime libraries in the code regions recommended for offloading.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Non-User Code Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Non-User Code Metrics column group.

Time in MPI

Description: Time spent in MPI calls in the code regions recommended for offloading.

Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Non-User Code Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for display: Expand the Non-User Code Metrics column group.

To Target

Description: Estimated data transferred to a target platform from a shared memory by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Data Transfer with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.

ToFrom Target

Description: Sum of estimated data transferred both to/from a shared memory to/from a target platform by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Data Transfer with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.

Total

Description: Sum of the total estimated traffic incoming to a target platform and the total estimated traffic outgoing from the target platform, for an offload loop, in megabytes. It is calculated as (MappedTo + MappedFrom + 2*MappedToFrom). If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Data Transfer with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisites for collection:

Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.

Total, GB (CARM)

Description: Total data transferred to and from execution units, in gigabytes.

Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the CARM (EU <-> Data Port) column group in the GPU pane of the GPU Roofline Regions tab.

Total, GB (GPU Memory)

Description: Total amount of data transferred to and from GPU, chip uncore (LLC), and main memory, in gigabytes.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions tab.

Total, GB (L3 Shader)

Description: Total amount of data transferred between execution units and L3 caches, in gigabytes.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the L3 Shader column group in the GPU pane of the GPU Roofline Regions report.

Total, GB (Shared Local Memory)

Description: Total amount of data transferred to and from the shared local memory.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions tab.

Total, GB/s

Description: Average data transfer bandwidth between CPU and GPU.

Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the Data Transferred column group in the GPU pane of the GPU Roofline Regions tab.

Interpretation: In some cases, for example, clEnqueueMapBuffer, data transfers might generate high bandwidth because memory is not copied but shared using L3 cache.

Total Size

Description: Total data processed on a GPU.

Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the Data Transferred column group in the GPU pane of the GPU Roofline Regions tab.

Total Time

Description: Total amount of time spent executing a task.

Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Compute Task Details column group.

Total without Reuse

Description: Sum of the total estimated traffic incoming to a target platform and the total estimated traffic outgoing from the target platform considering no data is reused, in megabytes. It is calculated as (MappedTo + MappedFrom + 2*MappedToFrom). This metric is available only if you enabled the data reuse analysis for the Performance Modeling.

Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for collection:

Prerequisite for display: Expand the Estimated Bounded By column group.

U

V

W

Write

Description: Estimated data written to a target platform by a loop. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.

Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for collection:

Prerequisite for display: Expand the Estimated Bounded By column group.

Write, GB (GPU Memory)

Description: Total amount of data written to GPU, chip uncore (LLC), and main memory, in gigabytes.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions tab.

Prerequisites for display: Expand the GPU Memory column group.

Write, GB (Shared Local Memory)

Description: Total amount of data written to the shared local memory, in gigabytes.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions tab.

Prerequisites for display: Expand the Shared Local Memory column group.

Write, GB/s (GPU Memory)

Description: Rate at which data is written to GPU, chip uncore (LLC), and main memory, in gigabytes per second.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions tab.

Prerequisites for display: Expand the GPU Memory column group.

Write, GB/s (Shared Local Memory)

Description: Rate at which data is written to shared local memory, in gigabytes per second.

Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions tab.

Prerequisites for display: Expand the Shared Local Memory column group.

Write without Reuse

Description: Estimated data written to a target platform by a code region considering no data is reused, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.

Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.

Prerequisite for collection:

Prerequisite for display: Expand the Estimated Bounded By column group.

X, Y, Z