Use Automatic Vectorization

The information below will guide you in setting up the auto-vectorizer.

Vectorization Speedup

Where does the vectorization speedup come from? Consider the following sample code, where a, b, and c are integer arrays:


do I=1,MAX
    C(I)=A(I)+B(I)
end do

If vectorization is not enabled, and you compile using the O1, -no-vec (Linux), or /Qvec- (Windows) option, the compiler processes the code with unused space in the SIMD registers, even though each register can hold three additional integers. If vectorization is enabled (compiled using O2 or higher options), the compiler may use the additional registers to perform four additions in a single instruction. The compiler looks for vectorization opportunities whenever you compile at default optimization (O2) or higher.

Note

This option enables vectorization at default optimization levels for both Intel® microprocessors and non-Intel microprocessors. Vectorization may call library routines that can result in additional performance gain on Intel® microprocessors than on non-Intel microprocessors. The vectorization can also be affected by certain options, such as -arch or -x (Linux), or /arch or /Qx (Windows).

Tip

This tip is only for the Intel® Fortran (ifort) Classic Compiler. To allow comparisons between vectorized and non-vectorized code, disable vectorization using the -no-vec (Linux) or /Qvec- (Windows) option; enable vectorization using the O2 option.

To learn if a loop was vectorized or not, enable generation of the optimization report using the options qopt-report=1 qopt-report-phase=vec (Linux) or Qopt-report:1 Qopt-report-phase:vec (Windows) options. These options generate a separate report in an *.optrpt file that includes optimization messages. In Microsoft Visual Studio, the program source is annotated with the report's messages, or you can read the resulting .optrpt file using a text editor. A message appears for every loop that is vectorized, for example:

ifort /Qopt-report:1 matvec.f90
type matvec.optrpt
…
   LOOP BEGIN at C:\Projects\vec_samples\matvec.f90(38,6)
      remark #15300: LOOP WAS VECTORIZED
   LOOP END

The source line number (38 in the above example) refers to either the beginning or the end of the loop.

To get details about the type of loop transformations and optimizations that took place, use the [Q]opt-report-phase option by itself or along with the [Q]opt-report option.

Linux

To evaluate performance enhancement, run Vectorize VecMatMult:

  1. Download and run the driver.f90 and matvec.f90 samples from Vectorize VecMatMul src folder on GitHub.

  2. This application multiplies a vector by a matrix using the following loop:

    
    do i=1,size1
       c(i) = c(i) + a(i,j) * b(j)
    end do

  3. Compile and run the application, first without enabling auto-vectorization. The default O2 optimization enables vectorization, so you need to disable it with a separate option.

    ifx -no-vec  driver.f90 matvec.f90 -o NoVectMult
    ./NoVectMult

  4. Build and run the application, this time with auto-vectorization.

    ifx driver.f90 matvec.f90 -o VectMult 
    ./VectMult

Windows

To evaluate performance enhancement, run Vectorize VecMatMult:

  1. Select Start > Intel oneAPI <version> > Intel oneAPI Command Prompt for Intel 64 for Visual Studio <version>.

  2. Download and run the driver.f90 and matvec.f90 samples from the Vectorize VecMatMul src folder on GitHub.

  3. This application multiplies a vector by a matrix using the following loop:

    
    do i=1,size1
       c(i) = c(i) + a(i,j) * b(j)
    end do

  4. Compile and run the application, first without enabling auto-vectorization. The default O2 optimization enables vectorization, so you need to disable it with a separate option.

    ifx /Qvec- driver.f90 matvec.f90 /exe:NoVectMult
    NoVectMult

  5. Build and run the application, this time with auto-vectorization.

    ifx driver.f90 matvec.f90 /exe:VectMult
    VectMult

When you compare the timing of the two runs, you may see that the vectorized version runs faster. The time for the non-vectorized version is only slightly faster than would be obtained by compiling with the O1 option.

Obstacles to Vectorization

The following issues do not always prevent vectorization, but frequently cause the compiler to decide that vectorization would not be worthwhile.

See Also