Programming with the Intel® oneAPI Level Zero Backend

This page shows the supported scenarios for multi-card and multi-tile programming with the Intel® oneAPI Level Zero (Level Zero) Backend.

Device Discovery

Root-devices

In this programming model, Intel GPUs are represented as SYCL GPU devices, or root-devices. You can find your root-device with the sycl-ls tool. For example:

sycl-ls

Example output:

[opencl:gpu:0] Intel® OpenCL HD Graphics, Intel® UHD Graphics 630 [0x3e92] 3.0 [21.49.21786]
[opencl:cpu:1] Intel® OpenCL, Intel® Core™ i7-8700K CPU @ 3.70GHz 2.1 [2020.11.11.0.03_160000]
[ext_oneapi_level_zero:gpu:0] Intel® Level-Zero, Intel® UHD Graphics 630 [0x3e92] 1.2 [1.2.21786]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

sycl-ls shows the devices and platforms of all the SYCL backends, which are seen by the SYCL runtime. The previous example shows the CPU (managed by an OpenCL™ backend) and two GPUs that correspond to the single physical GPU (managed by an OpenCL™ or Level Zero backend). You have two options to filter the observable root-devices:

Option One

Use the environment variable ONEAPI_DEVICE_SELECTOR, which is described in the Environment Variables. For example:

ONEAPI_DEVICE_SELECTOR=ext_oneapi_level_zero sycl-ls

Example output:

[ext_oneapi_level_zero:gpu:0] Intel® Level-Zero, Intel® UHD Graphics 630 [0x3e92] 1.2 [1.2.21786]

Option Two

Use a similar API, as described in the Filter Selector, for example, the filter_selector("ext_oneapi_level_zero") only sees Level Zero operated devices.

If there are multiple GPUs in a system, they are seen as multiple root-devices. On Linux, you will see multiple SYCL root-devices of the same SYCL platform. On Windows, you will see root-devices of multiple different SYCL platforms.

You can use CreateMultipleRootDevices=N NEOReadDebugKeys=1 environment variables to emulate multiple GPU cards. For example:

CreateMultipleRootDevices=2 NEOReadDebugKeys=1 ONEAPI_DEVICE_SELECTOR=ext_oneapi_level_zero sycl-ls

Example output:

[ext_oneapi_level_zero:gpu:0] Intel® Level-Zero, Intel® UHD Graphics 630 [0x3e92] 1.2 [1.2.21786]
[ext_oneapi_level_zero:gpu:1] Intel® Level-Zero, Intel® UHD Graphics 630 [0x3e92] 1.2 [1.2.21786]

Note

CreateMultipleRootDevices is experimental, not validated, and is used for debug/experimental purposes only.

Sub-devices

Some Intel GPU hardware is composed of multiple tiles, where the root-devices can be partitioned into sub-devices that correspond to the physical tiles. For example:

try {
  vector<device> SubDevices = RootDevice.create_sub_devices<
  sycl::info::partition_property::partition_by_affinity_domain>(
  sycl::info::partition_affinity_domain::next_partitionable);
}

Each call to create_sub_devices returns the same sub-devices in their persistent order. Use the ZE_AFFINITY_MASK environment variable to control what sub-devices are exposed by the Level Zero driver. The partition_by_affinity_domain is the only type of partitioning supported for Intel GPUs. The next_partitionable and numa properties are the only partitioning properties supported.

The CreateMultipleSubDevices=N NEOReadDebugKeys=1 environment variables can be used to emulate multiple tiles of a GPU.

Note

CreateMultipleSubDevices is experimental, not validated, and is used for debug/experimental purposes only.

Contexts

Contexts are used for resource isolation and sharing. A SYCL context may consist of one or multiple devices. Both root-devices and sub-devices can be found within a single context, but they need to be from the same SYCL platform. A SYCL kernel_bundle created against a context with multiple devices is built to each of the root-devices in the context. For a context that consists of multiple sub-devices of the same root-device, only a single build (to that root-device) is needed.

Memory

Unified Shared Memory (USM)

You have multiple ways to allocate memory:

Memory allocated against a root-device is accessible by all of its sub-devices (tiles). If you are operating on a context with multiple sub-devices of the same root-device, then you can use malloc_device on that root-device instead of using the slower malloc_host. If you are using malloc_device you need an explicit copy out to the host to see the data located there.

Buffers

SYCL buffers that are created against a context and under the hood are mapped to the Level Zero USM allocation. The mapping details are:

Queues

A SYCL queue is always attached to a single device in a potential multi-device context. The following example scenarios are listed from most to least performant:

Scenario One

Context with a single sub-device in it, where the queue is attached to that sub-device (tile):

For example:

try {
  vector<device> SubDevices = ...;
  for (auto &D : SubDevices) {
    // Each queue is in its own context, no data sharing across them.
    auto Q = queue(D);
    Q.submit([&](handler& cgh) {...});
  }
}

Scenario Two

Context with multiple sub-devices of the same root-device (multi-tile):

For example:

try {
  vector<device> SubDevices = ...;
  auto C = context(SubDevices);
  for (auto &D : SubDevices) {
    // All queues share the same context, data can be shared across queues.
    auto Q = queue(C, D);
    Q.submit([&](handler& cgh) {...});
  }
}

Scenario Three

Context with a single root-device in it, where the queue is attached to that root-device:

For example:

try {
  // The queue is attached to the root-device, driver distributes to sub-devices, if any.
  auto D = device(gpu_selector{});
  auto Q = queue(D);
  Q.submit([&](handler& cgh) {...});
}

Scenario Four

Contexts with multiple root-devices (multi-card):

For example:

try {
  auto P = platform(gpu_selector{});
  auto RootDevices = P.get_devices();
  auto C = context(RootDevices);
  for (auto &D : RootDevices) {
    // Context has multiple root-devices, data can be shared across multi-card (requires explicit copying)
    auto Q = queue(C, D);
    Q.submit([&](handler& cgh) {...});
  }
}

Note

Do not forget to allocate/synchronize your memory for your programming model and algorithm.

Multi-tile Multi-card Examples

For your next steps, you can explore two examples of multi-tile and multi-card programming: