Intel® Advisor Help
This reference section describes the contents of data columns in reports of the Offload Modeling and GPU Roofline Insights perspectives.
# | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | XYZ
Description: Average percentage of time when both FPUs are used.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Instructions column group in the GPU pane of the GPU Roofline Regions tab.
Description: Percentage of cycles actively executing instructions on all execution units (EUs).
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Array column group in the GPU pane of the GPU Roofline Regions tab.
Description: Total execution time by atomic throughput, in milliseconds.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane.
Prerequisite for display:
the Estimated Bounded By column group.
Description: Average amount of time spent executing one task instance.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Compute Task Details column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display: Expand the Compute Task Details column.
Description: Rate at which data is transferred to and from GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions tab.
Description: Rate at which data is transferred between execution units and L3 caches, in gigabytes per second.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the L3 Shader column group in the GPU pane of the GPU Roofline Regions report.
Description: Rate at which data is transferred to and from shared local memory, in gigabytes per second.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions report.
Description: Estimated execution time assuming an offloaded loop is bound only by compute throughput.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Name of a compute task.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.
Description: Average amount of time spent executing one task instance. When collapsed, corresponds to the Average column. Expand to see more metrics.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.
Description: Action that a compute task performs.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.
Description: Total number of threads started across all execution units for a computing task.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.
Description: Estimated time cost, in milliseconds, for transferring loop data between host and target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation to Light, Medium, or Full.
CLI: Run the --collect=tripcounts action with the --data-transfer=[full | medium | light] action options.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Estimated time cost, in milliseconds, for transferring loop data between host and target platform considering no data is reused. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation under Characterization to Full and enable the Data Reuse Analysis checkbox under Performance Modeling.
CLI: Use the --data-transfer=full action option with the --collect=tripcounts action and the --data-reuse-analysis option with the --collect=tripcounts and --collect=projection actions.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Difference, in milliseconds, between data transfer time estimated with data reuse and without data reuse. This option is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation under Characterization to Full and enable the Data Reuse Analysis checkbox under Performance Modeling.
CLI: Use the --data-transfer=full action option with the --collect=tripcounts action and the --data-reuse-analysis option with the --collect=tripcounts and --collect=projection actions.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Dependency absence or presence in a loop across iterations.
Collected during the Survey and Dependencies analyses in the Offload Modeling perspective and found in the Measured column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Measured column group.
Possible values:
Prerequisites for collection/display:
Some values in this column can appear only if you select specific options when collecting data or run the Dependencies analysis:
For Parallel: Workload and Dependency: <dependency-type>:
GUI: Enable Dependencies analysis in the Analysis Workflow pane.
CLI: Run advisor --collect=dependencies --project-dir=<project-dir> [<options>] -- <target>. See advisor Command Line Interface Reference for details.
For Parallel: User:
GUI: Go to Project Properties > Performance Modeling. In the Other parameters field, enter a --set-parallel=<string> and a comma-separated list of loop IDs and/or source locations to mark them as parallel.
CLI: Specify a comma-separated list of loop IDs and/or source locations with the --set-parallel=<string> option when modeling performance with advisor --collect=projection.
For Dependency: User:
GUI: Go to Project Properties > Performance Modeling. In the Other parameters field, enter a --set-dependency=<string> and a comma-separated list of loop IDs and/or source locations to mark them as having dependencies.
CLI: Specify a comma-separated list of loop IDs and/or source locations with the --set-dependency=<string> option when modeling performance with advisor --collect=projection.
For Parallel: Assumed:
GUI: Disable Assume Dependencies under Performance Modeling analysis in the Analysis Workflow pane.
CLI: Use the --no-assume-dependencies option when modeling performance with advisor --collect=projection.
For Dependencies: Assumed:
GUI: Enable Assume Dependencies under Performance Modeling analysis in the Analysis Workflow pane.
CLI: Use the --assume-dependencies option when modeling performance with advisor --collect=projection.
Interpretation:
Description: A host platform that application is executed on.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Measured column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Measured column group.
Description: Summary of DRAM memory usage, including DRAM bandwidth (in gigabytes per second) and total DRAM traffic, which is a sum of read and write traffic.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Description: DRAM Bandwidth. Estimated execution time, in seconds, assuming an offloaded loop is bound only by DRAM memory throughput.
Collected during the Trip Counts analysis (Characterization) and the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: DRAM Bandwidth. Rate at which data is transferred to and from the DRAM, in gigabytes per second.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: DRAM bandwidth utilization, in per cent.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total data read from the DRAM memory.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: A sum of data read from and written to the DRAM memory.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total data written to the DRAM memory.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Wall-clock time from beginning to end of computing task execution.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.
Description: Percentage of cycles on all execution units (EUs) and thread slots when a slot has a thread scheduled.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.
Description: Summary of data read from a target platform and written to the target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on the target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation to Light, Medium, or Full.
CLI: Run the --collect=tripcounts action with the --data-transfer=[full | medium | light] action options.
Description: Ratio of FLOP to the number of transferred bytes.
Collected during the FLOP analysis (Characterization) enabled in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions tab.
Description: A percentage of time spent in code regions profitable for offloading in relation to the total execution time of the region.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Basic Estimated Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for display: Expand the Basic Estimated Metrics column group.
Interpretation: 100% means there are no non-offloaded child regions, calls to parallel runtime libraries, or system calls in the region.
Description: Estimated data transferred from a target platform to a shared memory by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Data Transfer with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation to Light, Medium, or Full.
CLI: Run the --collect=tripcounts action with the --data-transfer=[full | medium | light] action options.
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.
Description: Number of giga floating-point operations.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions tab.
Instruction types counted during Characterization collection:
Description: Number of giga floating-point operations per second.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions tab.
Instruction types counted during Characterization collection:
Description: Number of giga integer operations.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions tab.
Instruction types counted during Characterization collection:
Description: Number of giga integer operations per second.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions tab.
Instruction types counted during Characterization collection:
Description: Total number of work items in all work groups.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Work Size column group in the GPU pane of the GPU Roofline Regions tab.
Description: Total estimated number of work items in a loop executed after offloaded on a target platform.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analyses in the Offload Modeling perspective and found in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Compute Estimates column group.
Description: Total number of shader atomic memory accesses.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.
Description: Total number of shader barrier messages.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the GPU pane of the GPU Roofline Regions tab.
Description: Percentage of cycles on all execution units (EUs), during which no threads are scheduled on a EU.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Array column group in the GPU pane of the GPU Roofline Regions report.
Description: Total estimated number of times a loop executes on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Compute Estimates column group.
Description: Total number of times a task is executed.
Collected during the Trip Counts analysis (Characterization) analysis in the GPU Roofline Insights perspective and found in the Compute Task Details column group in the GPU pane of the GPU Roofline Regions report.
Prerequisites for display: Expand the Compute Task Details column.
Description: Ratio of INTOP to the number of transferred bytes.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the GPU Compute Performance column group in the GPU pane of the GPU Roofline Regions report.
Instruction types counted during Characterization collection:
Description: Average rate of instructions per cycle (IPC) calculated for two FPU pipelines.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Instructions column group in the GPU pane of the GPU Roofline Regions report.
Description: Total estimated time cost for invoking a kernel when offloading a loop to a target platform. Does not include data transfer costs.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Top uncovered latency in a loop/function, in milliseconds.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Description: L3 Bandwidth. Estimated execution time, in seconds, assuming an offloaded loop is bound only by L3 cache throughput.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Summary of L3 cache usage, including L3 cache bandwidth (in gigabytes per second) and L3 cache traffic, which is a sum of read and write traffic.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Description: Average rate at which data is transferred to and from the L3 cache, in gigabytes per second.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: L3 cache bandwidth utilization, in per cent.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: L3 cache line utilization for data transfer, in percentage.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the CARM (EU <-> Data Port) column group in the GPU pane of the GPU Roofline Regions tab.
Description: Total data read from the L3 cache.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: A sum of data read from and written to the L3 cache.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total data written to the L3 cache.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Last-level cache (LLC) bandwidth. Estimated execution time, in seconds, assuming an offloaded loop is bound only by LLC throughput.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Last-level cache (LLC) usage, including LLC cache bandwidth (in gigabytes per second) and total LLC cache traffic, which is a sum of read and write traffic.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Description: Rate at which data is transferred to and from the LLC cache, in gigabytes per second.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: LLC cache bandwidth utilization, in per cent.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total data read from the LLC cache.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: A sum of data read from and written to the LLC cache.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Total data written to the LLC cache.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Uncovered cache or memory load latencies uncovered in a code region, in milliseconds.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane.
Prerequisite for display: Estimated Bounded By column group.
Description: Number of work items in one work group.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Work Size column group in the GPU pane of the GPU Roofline Regions report.
Description: Total estimated number of work items in one work group of a loop executed after offloaded on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Compute Estimates column group.
Description: Name and source location of a loop/function in a region, where region is a sub-tree of loops/functions in a call tree.
Collected during the Survey analysis in the Offload Modeling perspective.
Description: Time spent in non-offloaded parts of the code regions recommended for offloading.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Non-User Code Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.
Description: Total time spent for transferring data and launching kernel, in milliseconds.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Recommendation that indicates if a loop is profitable for offloading to a target platform.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Basic Estimated Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.
Description: Number of loop iterations or kernel work items executed in parallel on a target device for a loop/function.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.
Description: Estimated number of threads scheduled simultaneously on all execution units (EU).
Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for display: Expand the Compute Estimates column group.
Description: Recommendations for offloading code regions with estimated performance summary and/or potential issues with optimization hints. Each recommendation also includes examples of using DPC++ and OpenMP* programming modeling to offload the code regions and/or fix the performance issue.
Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the CPU+GPU pane of the Accelerated Regions tab.
Description: Total estimated data transferred to a private memory from a target platform by a loop. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Data Transfers with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation to Light, Medium, or Full.
CLI: Run the --collect=tripcounts action with the --data-transfer=[full | medium | light] action options.
Prerequisite for display: Expand the Estimated Data Transfers with Reuse column group.
Description: Programming model used in a loop/function, if any.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Measured column group in the CPU+GPU pane.
Prerequisite for display: Expand the Measured column group.
Description: Estimated data read from a target platform by an offload region, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) analysis in the Offload Modeling perspective and found in the Estimated Data Transfers with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation to Light, Medium, or Full.
CLI: Run the --collect=tripcounts action with the --data-transfer=[full | medium | light] action options.
Prerequisite for display: Expand the Estimated Data Transfers with Reuse column group.
Description: Total amount of data read from GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display: Expand the GPU Memory column group.
Description: Total amount of data read from the shared local memory, in gigabytes.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display: Expand the Shared Local Memory column group.
Description: Rate at which data is read from GPU, chip uncore (LLC), and main memory, in gigabytes per second.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions report.
Prerequisites for display: Expand the GPU Memory column group.
Description: Rate at which data is read from shared local memory, in gigabytes per second.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions report.
Prerequisites for display: Expand the Shared Local Memory column group.
Description: Estimated data read from a target platform by a code region considering no data is reused between kernels, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Data Transfers with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation under Characterization to Full and enable the Data Reuse Analysis checkbox under Performance Modeling.
CLI: Use the --data-transfer=full action option with the --collect=tripcounts action and the --data-reuse-analysis option with the --collect=tripcounts and --collect=projection actions.
Prerequisite for display: Expand the Estimated Data Transfers with Reuse column group.
Description: Percentage of cycles on all execution units when EU Send pipeline is actively processed.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Instructions column group in the GPU pane of the GPU Roofline Regions report.
Description: The number of work items processed by a single GPU thread.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Compute Task Details column group in the GPU pane of the GPU Roofline Regions report.
Description: Estimated number of work items processed by a single thread on a target platform.
Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for display: Expand the Compute Estimates column group.
Description: Percentage of cycles on al execution units (EUs) when at least one thread is scheduled, but the EU is stalled.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the EU Array column group in the GPU pane of the GPU Roofline Regions report.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Compute Task Details column group in the GPU pane of the GPU Roofline Regions report.
Description: Estimated speedup after a loop is offloaded to a target device, in comparison to the original elapsed time.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Basic Estimated Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.
Description: The highest estimated time cost and a sum of all other costs for offloading a loop from host to a target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform. A triangle icon in a table cell indicates that this region reused data.
This decreases the estimates data transfer tax.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Description: Average percentage of thread slots occupied on all execution units estimated on a target device.
Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for display: Expand the Compute Estimates column group.
Description: Estimated number of threads scheduled simultaneously per execution unit (EU).
Collected during the Performance Modeling analysis in the Offload Modeling perspective andfound in the Compute Estimates column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for display: Expand the Compute Estimates column group.
Description: Top two factors that a loop/function is bounded by, in milliseconds.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane.
Description: Elapsed wall-clock time from beginning to end of loop execution measured on a host platform.
Collected during the Survey analysis in the Offload Modeling perspective and found in the Basic Estimated Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.
Description: Estimated elapsed wall-clock time from beginning to end of loop execution estimated on a target platform after offloading.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Measured column group in the CPU+GPU pane of the Accelerated Regions tab.
Description: Loop/function execution time bounded by DRAM bandwidth, in seconds.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Loop/function execution time bounded by L3 cache bandwidth, in seconds.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Loop/function execution time bounded by LLC cache bandwidth, in seconds.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Memory Estimations column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, enable the Cache Simulation checkbox.
CLI: Run the --collect=tripcounts action with the --enable-cache-simulation and --target-device=<device> action options.
Prerequisites for display: Expand the Memory Estimations column group.
Description: Time spent in system calls and calls to ignored modules or parallel runtime libraries in the code regions recommended for offloading.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Non-User Code Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Non-User Code Metrics column group.
Description: Time spent in MPI calls in the code regions recommended for offloading.
Collected during the Performance Modeling analysis in the Offload Modeling perspective and found in the Non-User Code Metrics column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display: Expand the Non-User Code Metrics column group.
Description: Estimated data transferred to a target platform from a shared memory by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Data Transfer with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation to Light, Medium, or Full.
CLI: Run the --collect=tripcounts action with the --data-transfer=[full | medium | light] action options.
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.
Description: Sum of estimated data transferred both to/from a shared memory to/from a target platform by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Data Transfer with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation to Light, Medium, or Full.
CLI: Run the --collect=tripcounts action with the --data-transfer=[full | medium | light] action options.
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.
Description: Sum of the total estimated traffic incoming to a target platform and the total estimated traffic outgoing from the target platform, for an offload loop, in megabytes. It is calculated as (MappedTo + MappedFrom + 2*MappedToFrom). If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Data Transfer with Reuse column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation to Light, Medium, or Full.
CLI: Run the --collect=tripcounts action with the --data-transfer=[full | medium | light] action options.
Prerequisite for display: Expand the Estimated Data Transfer with Reuse column group.
Description: Total data transferred to and from execution units, in gigabytes.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the CARM (EU <-> Data Port) column group in the GPU pane of the GPU Roofline Regions tab.
Description: Total amount of data transferred to and from GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions tab.
Description: Total amount of data transferred between execution units and L3 caches, in gigabytes.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the L3 Shader column group in the GPU pane of the GPU Roofline Regions report.
Description: Total amount of data transferred to and from the shared local memory.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions tab.
Description: Average data transfer bandwidth between CPU and GPU.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the Data Transferred column group in the GPU pane of the GPU Roofline Regions tab.
Interpretation: In some cases, for example, clEnqueueMapBuffer, data transfers might generate high bandwidth because memory is not copied but shared using L3 cache.
Description: Total data processed on a GPU.
Collected during the FLOP analysis (Characterization) in the GPU Roofline Insights perspective and found in the Data Transferred column group in the GPU pane of the GPU Roofline Regions tab.
Description: Total amount of time spent executing a task.
Collected during the Survey analysis in the GPU Roofline Insights perspective and found in the Compute Task Details column group.
Description: Sum of the total estimated traffic incoming to a target platform and the total estimated traffic outgoing from the target platform considering no data is reused, in megabytes. It is calculated as (MappedTo + MappedFrom + 2*MappedToFrom). This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation under Characterization to Full and enable the Data Reuse Analysis checkbox under Performance Modeling.
CLI: Use the --data-transfer=full action option with the --collect=tripcounts action and the --data-reuse-analysis option with the --collect=tripcounts and --collect=projection actions.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Estimated data written to a target platform by a loop. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected during the Trip Counts analysis (Characterization) in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation under Characterization to Light, Medium, or Full.
CLI: Use the --data-transfer=[full | medium | light] option with the --collect=tripcounts action.
Prerequisite for display: Expand the Estimated Bounded By column group.
Description: Total amount of data written to GPU, chip uncore (LLC), and main memory, in gigabytes.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display: Expand the GPU Memory column group.
Description: Total amount of data written to the shared local memory, in gigabytes.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display: Expand the Shared Local Memory column group.
Description: Rate at which data is written to GPU, chip uncore (LLC), and main memory, in gigabytes per second.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the GPU Memory column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display: Expand the GPU Memory column group.
Description: Rate at which data is written to shared local memory, in gigabytes per second.
Collected during the Characterization analysis in the GPU Roofline Insights perspective and found in the Shared Local Memory column group in the GPU pane of the GPU Roofline Regions tab.
Prerequisites for display: Expand the Shared Local Memory column group.
Description: Estimated data written to a target platform by a code region considering no data is reused, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the Offload Modeling perspective and found in the Estimated Bounded By column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection:
GUI: From the Analysis Workflow pane, set the Data Transfer Simulation under Characterization to Full and enable the Data Reuse Analysis checkbox under Performance Modeling.
CLI: Use the --data-transfer=full action option with the --collect=tripcounts action and the --data-reuse-analysis option with the --collect=tripcounts and --collect=projection actions.
Prerequisite for display: Expand the Estimated Bounded By column group.