Intel Advisor Cookbook

Visualize Performance Improvements with Roofline Compare

Use the Roofline Compare feature to identify similar loops or functions in different Roofline analysis results and help make informed optimization choices about your code. This section describes how to compare two Roofline analysis results to visualize improvements made by loops and functions in an application.

  1. Collect baseline Roofline results.

  2. Optimize with the NOALIAS macro.

  3. Re-run the Roofline analysis.

  4. Perform further optimizations.

Scenario

In this recipe, we’ll use the Roofline Compare feature to show us the improvements obtained in each step of a series of optimizations.

Ingredients

This section lists the hardware and software used to produce the specific results shown in this recipe:

Collect Baseline Roofline Results

With the default compiler optimization option set to O2, generate a Roofline analysis and save the result using the Snapshot feature . We’ll call this result Snapshot_Baseline. View the Roofline plot, as shown in the image below. As you hover the mouse over the dots, the performance metrics for the loops display. The crosshairs drawn between the loops, which when hovered over with the mouse highlights as blue horizontal and vertical lines, provide performance metrics for the complete program.

For better visibility of results, we will fix the L1, L2, L3, and DRAM bandwidth to the values shown in the Roofs Settings table, displayed below. Also, as the application is using only single precision floats, we will turn off the double precision peaks by clearing the Visible checkboxes. Save the view as a json file with the name Favourable View using the Save button. We will use the same settings in further Roofline plots by loading Favourable View.

Roofline plot for Snapshot_Baseline

In the Survey report for Snapshot_Baseline, note the following:

  1. The Elapsed time value in the top left corner. This is the baseline against which subsequent improvements will be measured.

  2. In the Type column, all detected loops are scalar.

  3. In the Why No Vectorization? column, the compiler detected or assumed vector dependence in most of the loops.

Survey report for Snapshot_Baseline

Optimize with the NOALIAS Macro

  1. Click the Why No Vectorization? tab, then click one of the loops for which the compiler previously detected or assumed vector dependence.

  2. Scroll down to the Recommendations section to view suggestions for vectorizing the loop. In the example below, one of the suggestions is to use the restrict keyword.

    Why No Vectorization? tab for Snapshot_Baseline

    restrict ensures that two pointers cannot point to overlapping memory regions. If the compiler knows that there is only one pointer to a memory block, it can produce better vectorized code. In the first optimization, we will try to limit the effect of pointer aliasing by providing some information to the compiler using the NOALIAS macro.

  3. In the Visual Studio* IDE, right-click the vec_samples project in the Solution Explorer, then choose Properties.

  4. Choose Configuration Properties > C/C++ > Command Line. In the Additional Options area, type /DNOALIAS.

  5. Click Apply, then click OK.

  6. Choose Build > Rebuild Solution.

Re-run the Roofline Analysis

  1. In the Vectorization Workflow pane, click the Collect button below Run Roofline and save a snapshot of the result as Snapshot_NoAlias (preferably in a new directory, though this is not strictly required).

  2. Load the Favourable View json file by clicking the menu icon Menu icon in the top right corner. Once the file is loaded, the roofs are adjusted accordingly to Snapshot_Baseline.
  3. Notice the improvements in the total performance of the program and loop in matvec at Multiply.c:60, as shown in the image below.

    Roofline plot for Snapshot_NoAlias

  4. In the Survey report, notice that:

    • The value in the Vector Instruction Set column is probably AVX2/AVX/SSE2, i.e., the default vector Instruction Set Architecture (ISA).

    • The compiler successfully vectorizes two loops: in matvec at Multiply.c:69 and in matvec at Multiply.c:60.

    • Elapsed time improves substantially.

    Survey report for Snapshot_NoAlias

  5. Open the Snapshot_Baseline snapshot.

  6. In Snapshot_Baseline, go to the Roofline plot and click the Compare drop-down list , followed by the + Load result for comparison icon. Intel Advisor shows any snapshots in the same directory as Snapshot_Baseline in the Ready for comparison list. These snapshots can be used for Roofline comparisons. Select Snapshot_NoAlias using the Load result for comparison option.

    Note

    You can remove a comparison result using the × Clear comparison result(s) icon.

    Use the Load result for comparison icon to add a new Roofline result to compare with the Snapshot_Baseline (Current) result

For the rest of this recipe, we’ll compare optimized snapshots against Snapshot_Baseline. The Current result therefore refers to Snapshot_Baseline. A different shape is used to plot the loops and functions in each snapshot. For example, in the image below, circles represent the Current result, while Squares represent the Snapshot_NoAlias results.

For better visibility, we''ll use the Filter In Selection feature. Right-click an interesting loop or function in the Roofline plot and select Filter In Selection. This shows only the position of that loop in the Roofline plot. This feature is very useful when you want to filter for an interesting loop in applications with hundreds of loops and functions. In this case, we'll filter in the loop in matvec at Multiply.c:60. To remove the filtering, right-click anywhere in the Roofline plot and choose Clear Filters.

Comparison of the Roofline plots of Snapshot_Baseline with Snapshot_NoAlias

ΔFLOPS (can be also INTOPS or OPS, depending on the data type) implies the Performance difference between the compared loop and current loop. The figure shows that the compared loop has an improved computational performance by 6.02 units*, as performance has increased from 2.35 to 8.37 units. In percentage terms:

*units can be GFLOPS/GINTOPS/Giga Mixed OPS depending on the data type. In the above result, the units are GFLOPS.

Δt implies the Total Time difference between the compared loop and current loop. In the above example, we can see that the compared loop has a Total Time value reduced by 2.028 s: from 2.820 s to 0.792 s.

Please note that the difference in the example is negative (-2.028), because we always subtract the current loop value from the compared loop value for both Δ (FLOPS, time) metrics. This allows the user to see both performance improvement and performance degradation depending on the selected loop.

In percentage terms, the Total Time difference is:

The dashed line displays the value of the performance difference (ΔFLOPS in our case) as a percentage of maximum performance values between two loops.

The Survey report and Roofline comparison plot side-by-side for Snapshot_NoAlias

Continue to Optimize: Dependencies and More

The QxHost option helps the compiler to generate instructions for the highest instruction set available on the compilation host processor. Rebuilding the solution using the /QxHost command-line option can help us further improve performance depending on the underlying hardware architecture.

The compiler is often conservative when assuming data dependencies and always assumes the worst-case scenario. We can use a refinement report to check for real data dependencies in loops. In earlier results, the compiler did not vectorize the loop in matvec at Multiply.c:82 because of assumed dependencies. If real dependencies are detected, this analysis can provide additional details to resolve those dependencies.

Run a Dependencies Analysis

  1. In the drop column in the Survey report, select the checkbox for the loop in matvec at Multiply.c:82.

  2. In the Vectorization Workflow pane, click the Collect button Intel Advisor control: Run analysis under Check Dependencies to produce a dependencies report.

  3. Usually, the Dependencies analysis takes a while to generate the report. If analysis time during this exercise is a consideration: click the Stop button under Check Dependencies to stop the current analysis once the site coverage progress bar shows 1/1 sites executed. This displays the results collected so far. However, note that outside of this recipe, doing so risks not finding all dependencies (for example, when you have several calls of selected cycles).

Assess Dependencies

In the top pane of the Refinement Reports window, notice that Intel Advisor reports a RAW and a WAW dependency in the loop in matvec at Multiply.c:82. The Dependencies Report tab in the bottom pane shows the source of the dependency: addition in the sumx variable.

Dependencies shown in the refinement report

The loop in matvec at Multiply.c:82 did not vectorize because of a reduction dependency caused by the addition in sumx. By running the Dependencies analysis, we verified that the dependency is real. The REDUCTION applies an OpenMP* SIMD directive with a reduction clause, so each SIMD lane computes its own sum, and the results are combined at the end. (Applying an OpenMP* SIMD directive without a reduction clause will generate incorrect code.)

  1. Rebuild the solution with the /DREDUCTION option. Re-run the Roofline analysis and save the result as Snapshot_xHost_Reduction.

    Survey report and Roofline plot for Snapshot_xHost_Reduction

  2. Observe that the loop in matvec at Multiply.c:82 is now vectorized. The Elapsed time is also improved.

  3. Open the Snapshot_Baseline result and, using the Roofline Compare feature, add Snapshot_NoAlias and Snapshot_xHost_Reduction for comparison.

The image below shows the results: an overall improvement in performance. Please make a note of triangle and square symbols ( and ), which represent loops from Snapshot_xHost_Reduction and Snapshot_NoAlias, respectively. We'll specifically focus on the loop in matvec at Multiply.c:60 using Filter In Selection, as it was the biggest hotspot in Snapshot_Baseline. The latest optimization has pushed the loop further upward. This shows that the runtime of the loop is improving, which is reflected in the overall elapsed time of the code.

Comparison of the Roofline plots of Snapshot_Baseline, Snapshot_NoAlias, and Snapshot_xHost_Reduction

Key Takeaways

See Also