Address Memory Bandwidth Bottlenecks

This topic is part of a tutorial that shows how to use the automated Roofline chart to make prioritized optimization decisions.

Perform the following steps:

Key take-aways from these steps:

Note

These steps use a prepackaged analysis result because of tutorial duration and hardware dependency considerations.

Open a Result Snapshot

Do one of the following:

Focus the Roofline Chart on the Data of Most Interest

  1. Use the display toggles to show the Roofline chart and Survey Report side by side.

  2. On the Intel Advisor toolbar, click the Loops And Functions filter drop-down and choose Loops.

    Intel Advisor: Filters

  3. In the Roofline chart:

    • Select the Use Single-Threaded Loops checkbox.

    • Click the Intel Advisor: Roofline menu control, then deselect the Visibility checkbox for all SP... roofs. (All variables in this sample code are double-precision, so there is no need to clutter the chart with single-precision rooflines.)

      Intel Advisor: Roofline Menu

      In the Point Colorization section, choose Colors of Point Weight Ranges to differentiate dot colors by runtime (red, yellow, and green).

      Click Intel Advisor: Control to save your changes.

    • Click the Intel Advisor: Roofline numerical zoom control control. In the x-axis fields, backspace over the existing values and enter 0.1 and 0.4. In the y-axis fields, backspace over the existing values and enter 7.4 and 45.5. Click the Intel Advisor: Save control button to save your changes.

Interpret Roofline Chart Data

Intel Advisor: Roofline chart and Survey Report

In the Roofline chart, notice the dot representing the loop in main at roofline.cpp:295 (the lower dot): It is positioned above the (offscreen) Scalar Add Peak roofline, and on the L2 Bandwidth roofline.

Why is the dot positioned there?

The probable answer: Loop performance is limited by a memory bandwidth bottleneck involving L2 cache.

How can we verify this?

  1. Check the Survey Report:

    • Notice the Vectorized Loops/Efficiency value for the loop in main at roofline.cpp:295: 100%.

      This 100% vectorization efficiency is why the dot is above the offscreen Scalar Add Peak roofline.

    • Click the data row for the loop in main at roofline.cpp:295 to view the associated source code in the Source tab.

  2. In the Source tab, scroll to source code lines 89-96 to view the associated data structure definition: Structure of Arrays (SOA).

    Intel Advisor: Source Tab

SOA is a good data layout for vectorization efficiency; however, our familiarity with the sample code tells us this data layout is preventing the tutorial dataset from fitting into L1 cache and causing many loads from L2 cache.

So the loop in main at roofline.cpp:295 is positioned on the L2 Bandwidth roofline because loop performance is indeed limited by a memory bandwidth bottleneck involving L2 cache.

How can we eliminate this memory bandwidth bottleneck?

Reorganizing code to optimize cache usage is a possible optimization technique.

The loop in main at roofline.cpp:310 does this very thing, which is why the corresponding dot (upper dot in the Roofline chart) is positioned above the L2 Bandwidth roofline:

  1. In the Survey Report, click the data row for the loop in main at roofline.cpp:310.

  2. In the Source tab, scroll to code lines 97-101 to view the data structure definition for this loop: Array of Structure of Arrays (AOSOA). When the loop in main at roofline.cpp:310 is in the AOSOA data layout, our familiarity with the sample code tells us the tutorial workload is split into two steps, and each step has a dataset that fits into L1 cache.