Estimate the C++ Application Speedup on a Target GPU

This recipe illustrates how to check if your C++ application is profitable to be offloaded to a target GPU device using Intel® Advisor.

  1. Prerequisites.
  2. Compile the C++ Mandelbrot sample.
  3. Run Offload Modeling without Dependencies analysis.
  4. View estimated performance results.
  5. Run Offload Modeling with Dependencies analysis.
  6. Rewrite the code in SYCL.
  7. Compare estimations and real performance on GPU.

Scenario

Offload Modeling workflow includes the following two steps:

  1. Collect application characterization metrics on CPU: run the Survey analysis, the Trip Counts and FLOP analysis, and optionally, the Dependencies analysis.

  2. Based on the metrics collected, estimate application execution time on a graphics processing unit (GPU) using an analytical model.

Information about loop-carried dependencies is important for application performance modeling because only parallel loops can be offloaded to GPU. Intel Advisor can get this information from an Intel Compiler, an application callstack tree, and/or based on the Dependencies analysis results. The Dependencies analysis is the most common way, but it adds high overhead to performance modeling flow.

In this recipe, we will first run the Offload Modeling assuming that the loops do not contain dependencies and then will verify this by running the Dependencies analysis for the profitable loops only.

There are three ways to run the Offload Modeling: from the Intel Advisor graphical user interface (GUI), from the Intel Advisor command line interface (CLI), or using Python* scripts delivered with the product. This recipe uses the CLI to run analyses and the GUI to view and investigate the results.

Ingredients

This section lists the hardware and software used to produce the specific result shown in this recipe:

Prerequisites

Set up environment variables for the tools:

<oneapi-install-dir>\setvars.bat

Compile the C++ Mandelbrot Sample

Consider the following when compiling the C++ version of the Mandelbrot sample:

Run the following command to compile the C++ version of the Mandelbrot sample:

icx.exe /Qm64 /Zi /nologo /W3 /O2 /Ob1 /Oi /D NDEBUG /D _CONSOLE /D _UNICODE /D UNICODE /EHsc /MD /GS /Gy /Zc:forScope /Fe"mandelbrot_base.exe" /TP src\main.cpp src\mandelbrot.cpp src\timer.cpp

For details about Intel C++ Compiler Classic options, see Intel® C++ Compiler Classic Developer Guide and Reference.

Run Offload Modeling without Dependencies Analysis

First, get rough performance estimations using a special operating mode of the performance model that ignores potential loop-carried dependencies. In the CLI, use the --no-assume-dependencies command line option to activate this mode.

To model the Mandelbrot application performance on the target GPU with the Gen9 GT2 configuration:

  1. Run Survey analysis to get baseline performance data:

    advisor --collect=survey --stackwalk-mode=online --static-instruction-mix --project-dir=.\advisor_results --search-dir sym=.\x64\Release --search-dir bin=.\x64\Release --search-dir src=. -- .\x64\Release\mandelbrot_base.exe
  2. Run Trip Counts and FLOP analysis to get call count data and model cache for the Gen9 GT2 configuration:

    advisor --collect=tripcounts --flop --stacks --enable-cache-simulation --data-transfer=light --target-device=gen9_gt2 --project-dir=.\advisor_results --search-dir sym=.\x64\Release --search-dir bin=.\x64\Release --search-dir src=. -- .\x64\Release\mandelbrot_base.exe
  3. Model application performance on the GPU with the Gen9 GT2 configuration ignoring assumed dependencies:

    advisor --collect=projection --config=gen9_gt2 --no-assume-dependencies --project-dir=.\advisor_results

    The --no-assume-dependencies option allows to minimize the estimated time and assumes a loop is parallel without dependencies.

The collected results are stored in the advisor_results project that you can open in the GUI.

View Estimated Performance Results

To view the results in the GUI:

  1. Run the following from the command prompt to open the Intel Advisor:

    advisor-gui
  2. Go to File > Open > Project..., navigate to the advisor_results project directory where you stored results, and open the .advixeproj project file.

  3. If the Offload Modeling report does not open, click Show Result on the Welcome pane.

    The Summary results collected for the advisor_results project should open.

Note

If you do not have the Intel Advisor GUI or need to check the results briefly before copying them to a machine with the Intel Advisor GUI, you can open an HTML report located at .\advisor_results\e000\pp000\data.0\report.html. See Identify Code Regions to Offload to GPU and Visualize GPU Usage for more information about the HTML report.

Explore Offload Modeling Summary

The Summary tab of the Offload Modeling report shows modeling results in several views:

Offload Modeling summary report for the C++ Mandelbrot sample

For the Mandelbrot application, consider the following data:

Explore Accelerated Regions Report

To open the full Offload Modeling report, do one of the following:

Accelerated Regions report shows details about all offloaded and non-offloaded code regions. Review the data reported in the following panes:

Run Offload Modeling with Dependencies Analysis

The Dependencies analysis detects loop-carried dependencies, which do not allow to parallelize the loop and offload it to GPU. At the same time, this analysis is slow: it adds a high runtime overhead to your target application execution time making it 5-100x slower. Run the Dependencies analysis if your code might not be effectively vectorized or parallelized.

  1. In the CPU+GPU table, expand the loop at mandelbrot.cpp:56 to see its child loops.

  2. Expand the Measured column group.

    The Dependency Type column reports Parallel: Assumed for the mandelbrot.cpp:56 and its child loops. This means that Intel Advisor marks these loops as parallel because you used the --no-assume-dependencies option for the performance modeling, but it does not have information about their actual dependency type.

    C++ Mandelbrot sample dependency types

    Note

    If you are sure that loops in your application are parallel, you can skip the Dependencies analysis. Such loops should have a Parallel: <reason> value in the Dependency Type column, where <value> is Explicit, Proven, Programming Model, or Workload.
  3. To check if the loops have real dependencies, run the Dependencies analysis.

    1. To minimize the Dependencies analysis overhead, select the loops with the Parallel: Assumed value to check their dependency type, for example, using loop IDs. Run the following command to get IDs of those loops:

      advisor --report=survey --project-dir=.\advisor_results -- .\x64\Release\mandelbrot_base.exe

      This command prints the Survey analysis results with loop IDs to the command prompt. The mandelbrot.cpp:57 and mandelbrot.cpp:56 loops have IDs 2 and 3.

      Intel Advisor CLI report with loop IDs for the C++ Manrdelbrot sample

    2. Run the Dependencies analysis with the --mark-up-list=2,3 option to analyze only the loops of interest:

      advisor --collect=dependencies --mark-up-list=2,3 --loop-call-count-limit=16 --filter-reductions --project-dir=.\advisor_results -- .\x64\Release\mandelbrot_base.exe
  4. Rerun the performance modeling to get the refined performance estimation:

    advisor --collect=projection --config=gen9_gt2 --project-dir=.\advisor_results
  5. Open the advisor_results projects with refined results in the GUI:

    advisor-gui .\advisor_results

Note

The results of the Dependencies analysis and Offload Modeling are based on the Survey and Trip Counts and FLOP data collected before.

In the Accelerated Regions report, the loop at mandelbrot.cpp:56 and its child loops have the Parallel: Workload value in the Dependency Type column. This means that Intel Advisor did not find loop-carried dependencies and these loops can be offloaded and executed on the GPU.

Rewrite the Code in SYCL

Now you can rewrite the code region at mandelbrot.cpp:56, which Intel Advisor recommends to execute on the target GPU using the SYCL programming model.

The SYCL code should include the following actions:

The resulting code should look like the following code snippet from the SYCL version of Mandelbrot sample

using namespace sycl;
// Create a queue on the default device. Set SYCL_DEVICE_TYPE environment
// variable to (CPU|GPU|FPGA|HOST) to change the device
queue q(default_selector{}, dpc_common::exception_handler);
// Declare data buffer
buffer data_buf(data(), range(rows, cols));
// Submit a command group to the queue
q.submit([&](handler &h) {
// Get access to the buffer
auto b = data_buf.get_access(h,write_only);
// Iterate over image and write to data buffer
h.parallel_for(range<2>(rows, cols), [=](auto index) {
…
b[index] = p.Point(c);
});
    });

Make sure your SYCL code (mandel.hpp in the SYCL sample) contains the same values of image parameters as the C++ version:

constexpr int row_size = 2048;
constexpr int col_size = 1024;

See SYCL page and oneAPI GPU Optimization Guide for more information.

Compare Estimations and Real Performance on GPU

  1. Compile the Mandelbrot sample as follows:

    dpcpp.exe /W3 /O2 /nologo /D _UNICODE /D UNICODE /Zi /WX- /EHsc /MD /I"$(ONEAPI_ROOT)\dev-utilities\latest\include" /Fe"mandelbrot_dpcpp.exe" src\main.cpp
  2. Run the compiled mandelbrot application:

    mandelbrot_dpcpp.exe
  3. Review the application output printed to the command prompt. It reports application execution time:

    Parallel time: 0.0121385s

    Mandelbrot calculation in the offloaded loop takes 12.1 ms on the GPU. This is close to the 12.3 ms execution time predicted by the Intel Advisor.

Key Take-Aways

See Also