Explicit SIMD SYCL* Extension

oneAPI provides an Explicit SIMD SYCL extension (ESIMD) for lower-level Intel GPU programming.

ESIMD provides APIs that are similar to Intel's GPU Instruction Set Architecture (ISA), but it enables you to write explicitly vectorized device code. This gives you more control over the generated code and enables you to depend less on compiler optimizations.

The specification, documented ESIMD API headers, and working code examples are available on GitHub*.

Note

This extension is under active development and the APIs are subject to change. There are currently a number of restrictions on the extension, which are specified below.

ESIMD kernels and functions always require a subgroup size of one, which means that the compiler does not provide vectorization across work items in a subgroup. Instead, you must explicitly express the vectorization in your code. Below is an example that adds the elements of two arrays and writes the results to the third:

float *A = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt));
float *B = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt));
float *C = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt));

for (unsigned i = 0; i < Size; ++i) {
  A[i] = B[i] = i;
}

// Many work items are needed. Each work item processes Vector Length (VL)
// elements of data.
cl::sycl::range<1> GlobalRange{Size / VL};
// Number of work items in each workgroup.
cl::sycl::range<1> LocalRange{GroupSize};

cl::sycl::nd_range<1> Range(GlobalRange, LocalRange);

auto e = q.submit([&](handler &cgh) {
  cgh.parallel_for<class Test>(Range, [=](nd_item<1> ndi) SYCL_ESIMD_KERNEL {
    using namespace sycl::INTEL::gpu;

    int i = ndi.get_global_id(0);
    simd<float, VL> va = block_load<float, VL>(A + i * VL);
    simd<float, VL> vb = block_load<float, VL>(B + i * VL);
    simd<float, VL> vc = va + vb;
    block_store<float, VL>(C + i * VL, vc);
  });
});

In the example above, the lambda function passed to the parallel_for is marked with a special attribute: SYCL_ESIMD_KERNEL. This tells the compiler that the kernel is ESIMD-based and ESIMD APIs can be used inside it. Here the simd objects and block_load/block_store intrinsics are used. They are available only in the ESIMD extension.

Fully runnable code samples can be found in the GitHub repo.

Compiling and Running ESIMD Code

To compile code that uses the ESIMD extension, you must use a special compiler switch, -fsycl-explicit-simd:

$ clang++ -fsycl -fsycl-explicit-simd vadd_usm.cpp

The resulting executable can be run only on Intel GPU hardware, such as Intel® UHD Graphics 600 or later. To run it, you must use two additional environment variables: SYCL_BE=PI_OPENCL and SYCL_PROGRAM_COMPILE_OPTIONS=-vc-codegen. For example:

$ SYCL_BE=PI_OPENCL SYCL_PROGRAM_COMPILE_OPTIONS=-vc-codegen ./a.out

Restrictions

This section contains lists of the main restrictions that apply when using the ESIMD extension.

Note

Some extensions are not enforced by the compiler, which may lead to undefined program behavior.

Features not supported with ESIMD

Unsupported standard SYCL APIs

  • Local accessors
  • Most image APIs
  • Specialization constants
  • Memory access through a raw pointer returned by sycl::accessor::get_pointer()

Other restrictions

  • Only Intel GPU devices are supported.
  • In most cases, the standard and ESIMD Data Parallel C++ (DPC++) kernels cannot co-exist in the same application.