Address Compute Capacity Bottlenecks

This topic is part of a tutorial that shows how to use the automated Roofline chart to make prioritized optimization decisions.

Perform the following steps:

Key take-aways from these steps:

Note

These steps use a prepackaged analysis result because of tutorial duration and hardware dependency considerations.

Open a Result Snapshot

Do one of the following:

Focus the Roofline Chart on the Data of Most Interest

  1. Use the display toggles to show the Roofline chart and Survey Report side by side.

  2. On the Intel Advisor toolbar, click the Loops And Functions filter drop-down and choose Loops.

    Intel Advisor: Filters

  3. In the Roofline chart:

    • Select the Use Single-Threaded Loops checkbox.

    • Click the Intel Advisor: Roofline menu control, then deselect the Visibility checkbox for all SP... roofs. (All variables in this sample code are double-precision, so there is no need to clutter the chart with single-precision rooflines.)

      Intel Advisor: Roofline Menu

      In the Point Colorization section, choose Vectiorized/Scalar to differentiate dot colors by scalar (blue) vs. vectorized (orange) instead of runtime (red, yellow, and green).

      Click Intel Advisor: Control to save your changes.

    • Click the Intel Advisor: Roofline numerical zoom control control. In the x-axis fields, backspace over the existing values and enter 0.1 and 0.8. In the y-axis fields, backspace over the existing values and enter 3.1 and 45.5. Click the Intel Advisor: Save control button to save your changes.

Interpret Roofline Chart Data

Intel Advisor: Roofline Chart & Survey Report

In the Roofline chart, notice the position of the blue dot representing the loop in main at roofline.cpp:221: It is positioned below both the Vector Add Peak and Scalar Add Peak rooflines. Why?

The probable reason: It is not vectorized, as indicated by the dot color. We can also quickly verify this scalar status in the first column of the Survey Report: Blue icon = scalar; orange icon = vectorized. (For the purposes of this tutorial, we used a directive to artificially block this loop from vectorizing. Intel Advisor has a wide variety of tools for diagnosing why loops do not vectorize, which you could use if this loop had a real problem.)

The loop in main at roofline.cpp:247 is a vectorized version of the loop in main at roofline.cpp:221.

In the Roofline chart, notice the position of the bottom orange dot representing this loop moved left on the Arithmetic Intensity axis. Theoretically, this is not possible. Arithmetic intensity = Floating-point operations per byte transferred. Any given algorithm has a specific arithmetic intensity that optimization should not change, because this metric is a trait of the algorithm itself. So dots on a Roofline chart move up as performance improves, but not side to side. Usually.

Apparently, compiler optimizations altered the arithmetic intensity of the loop in main at roofline.cpp:247.

How?

In the Code Analytics tab, check the GFLOPS drop-down region for both loops represented by the blue and bottom orange dots.

Intel Advisor: Code Analytics Tab

When we examine these statistics for both loops side by side, we can see the Self FLOP Per Iteration is different for the vectorized loop, but this makes sense: The loop has a vector length of 4, and 8*4=32. What changed is the Data Transfers values, both total and per iteration. The 256 cannot be explained by vectorization, because 48*4=192, not 256.

Metrics

Scalar Loop (Blue Dot)

Vectorized Loop (Bottom Orange Dot)

Self FLOP Per Iteration

8

32

Data Transfers: Total Gigabytes:

318.720

424.960

Data Transfers: Bytes Per Loop Iteration

48

256

These statistics tell us the compiler altered the memory accesses. New unpack and insert instructions - which are also noted in the Code Analytics tab - the probably played a role in the affected memory calculations.

The next unusual thing about the loop in main at roofline.cpp:247: The bottom orange dot representing it on the Roofline chart is positioned barely above the Scalar Add Peak roofline.

The probable reason: The loop is vectorized inefficiently.

How can we verify this?

In the Survey Report, check the value in the Vectorized Loops/Efficiency: 31%. Vectorization is not enough for a loop to rise to the Vector Add Peak roofline; a loop must be efficiently vectorized.

Why is the Efficiency value so low?

One probable reason: Inefficient memory accesses preventing full utilization of VPU/SIMD resources. Array of Structures (AOS) data layouts often cause this sort of problem. Check the Memory Access Patterns Report.

Intel Advisor: Memory Access Patterns Report

Memory access problem confirmed: The majority of memory accesses are not uniform stride. (The presence of insert operations, as noted in the Code Analytics tab, confirms this.) This contributes to vectorization inefficiency.

How can we improve vectorization efficiency?

Reorganizing code to use Structure of Arrays (SOA) data layout instead of Array of Structures (AOS) data layout is a possible optimization technique. In fact, this is one of the Recommendations the Intel Advisor offers to guide developers seeking optimization advice:

Intel Advisor: Recommendations Tab

Of course, this is exactly what we did with the loop in main at roofline.cpp:260 (middle orange dot).

Notice the dot is positioned just under the Vector Add Peak roofline. In the Survey Report, check the Vectorize Loops/Efficiency: 100%.

The dot also skipped a memory bandwidth roofline because, as our familiarity with the sample code tells us, it was never limited by L3 cache, and only partially limited by L2 cache.

So switching the data layout to SOA fixed both the compute capacity and memory bandwidth bottlenecks.

Finally, take a look at the loop in main at roofline.cpp:273 - the top orange dot. See if you can figure out why it is positioned above the Vector Add Peak roofline. Hint: In the Assembly tab, compare the instructions for the two loops represented by the top and middle orange dots.