Data Parallelism in C++ using SYCL*
Open, Multivendor, Multiarchitecture support for productive data parallel programming in C++ is accomplished via standard C++ with support for SYCL. SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that enables code for heterogeneous processors to be written using standard ISO C++ with the host and kernel code for an application contained in the same source file. The DPC++ open source project is adding SYCL support to the LLVM C++ compiler.
Simple Sample Code Using Queue Lambda by Reference
The best way to introduce SYCL is through an example. Since SYCL is based on modern C++, this example uses several features that have been added to C++ in recent years, such as lambda functions and uniform initialization. Even if developers are not familiar with these features, their semantics will become clear from the context of the example. After gaining some experience with SYCL, these newer C++ features will become second nature.
The following application sets each element of an array to the value of its index, so that a[0] = 0, a[1] = 1, etc.
#include <CL/sycl.hpp>
#include <iostream>
constexpr int num=16;
using namespace sycl;
int main() {
auto r = range{num};
buffer<int> a{r};
queue{}.submit([&](handler& h) {
accessor out{a, h};
h.parallel_for(r, [=](item<1> idx) {
out[idx] = idx;
});
});
host_accessor result{a};
for (int i=0; i<num; ++i)
std::cout << result[i] << "\n";
}
The first thing to notice is that there is just one source file: both
the host code and the offloaded accelerator code are combined in
a single source
file. The second thing to notice is that the syntax is standard C++:
there aren’t any new keywords or pragmas used to express the
parallelism. Instead, the parallelism is expressed through C++ classes.
For example, the buffer
class on line 9 represents data that will be
offloaded to the device, and the queue
class on line 11 represents a
connection from the host to the accelerator.
The logic of the example works as follows. Lines 8 and 9 create a buffer
of 16 int
elements, which have no initial value. This buffer acts
like an array. Line 11 constructs a queue
, which is a connection to
an accelerator device. This simple example asks the SYCL runtime to choose a
default accelerator device, but a more robust application would probably
examine the topology of the system and choose a particular accelerator.
Once the queue is created, the example calls the submit()
member
function to submit work to the accelerator. The parameter to this
submit()
function is a lambda function, which executes immediately
on the host. The lambda function does two things. First, it creates an
accessor
on line 12, which can write elements in the buffer. Second,
it calls the parallel_for()
function on line 13 to execute code on
the accelerator.
The call to parallel_for()
takes two parameters. One parameter is a
lambda function, and the other is the range
object “r
” that
represents the number of elements in the buffer. SYCL arranges for this
lambda to be called on the accelerator once for each index in that
range, i.e. once for each element of the buffer. The lambda simply
assigns a value to the buffer element by using the out
accessor that
was created on line 12. In this simple example, there are no
dependencies between the invocations of the lambda, so the program is free to
execute them in parallel in whatever way is most efficient for this
accelerator.
After calling parallel_for()
, the host part of the code continues
running without waiting for the work to complete on the accelerator.
However, the next thing the host does is to create a host_accessor
on line 18, which reads the elements of the buffer. The SYCL runtime knows this
buffer is written by the accelerator, so the host_accessor
constructor (line 18) is blocked until the work submitted by the
parallel_for()
is complete. Once the accelerator work completes, the
host code continues past line 18, and it uses the out
accessor to
read values from the buffer.
Additional Resources
This introduction to SYCL is not meant to be a complete tutorial. Rather, it just gives you a flavor of the language. There are many more features to learn, including features that allow you to take advantage of common accelerator hardware such as local memory, barriers, and SIMD. There are also features that let you submit work to many accelerator devices at once, allowing a single application to run work in parallel on many devices simultaneously.
The following resources are useful to learning and mastering SYCL using a oneAPI DPC++ compiler:
Explore SYCL with Samples from Intel provides an overview and links to simple sample applications available from GitHub*.
The DPC++ Foundations Code Sample Walk-Through is a detailed examination of the Vector Add sample code, the DPC++ equivalent to a basic Hello World application.
The oneapi.com site includes a Language Guide and API Reference with descriptions of classes and their interfaces. It also provides details on the four programming models - platform model, execution model, memory model, and kernel programming model.
The DPC++ Essentials training course is a guided learning path for SYCL using Jupyter* Notebooks on Intel® DevCloud.
Data Parallel C++ Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL is a comprehensive book that introduces and explains key programming concepts and language details about SYCL and Heterogeneous programming.