DPCT1117
Contents
DPCT1117#
Message#
There is atomicInc/Dec
operation on <argument 2> and value of <argument 2>
was scaled by migration for performance, refer to DPCT1116. Using the value of
<argument 2> through <argument 1> should also be scaled by dividing or
multiplying <expression>. You may need to adjust the code.
Detailed Help#
An optimized method has been applied to migrate atomicInc
/atomicDec
. To
align with this optimized method, some scaling adjustments are required to ensure
that values remain consistent and accurate after migration. Refer to DPCT1116
for more details.
When the code utilizes the address of an atomic variable, all the assignments of value or reference of value operation through the atomic variable address also need to be scaled by multiplying or dividing a step. Atomic variables also need to be scaled appropriately for all host-side memory operations.
Suggestions to Fix#
For example, this original CUDA* code:
1__device__ unsigned int a1;
2__global__ void kernel(){
3 ...
4 unsigned int old_val = atomicInc(&a1, 0x7fffffff);
5 ...
6}
7int main(){
8 unsigned int result;
9 unsigned int *d_addr;
10 cudaGetSymbolAddress((void **)&d_addr, a1);
11 cudaMemset(d_addr, 2, sizeof(unsigned int));
12 kernel<<<1, 100>>>();
13 ...
14 cudaMemcpy(&result, d_addr, sizeof(unsigned int), cudaMemcpyDeviceToHost);
15 ...
16}
results in the following migrated SYCL* code:
1dpct::global_memory<unsigned int, 0> a1;
2void kernel(unsigned int &a1){
3 ...
4 /*
5 DPCT1116:0: The atomicInc was migrated to dpct::atomic_fetch_add(&a1, 2) / 2 for performance, and 2 is computed by (UINT_MAX + 1) / (‘0x7fffffff’ + 1). This migration requires the initial value of ‘a1’ to be scaled by multiplying 2, and any usage of value of ‘a1’ outside atomic function to be scaled by dividing 2.
6 */
7 unsigned int old_val = dpct::atomic_fetch_add<sycl::access::address_space::generic_space>(&a1, 2) / 2;
8 ...
9}
10int main(){
11 dpct::device_ext &dev_ct1 = dpct::get_current_device();
12 sycl::queue &q_ct1 = dev_ct1.default_queue();
13 unsigned int result;
14 unsigned int *d_addr;
15 /*
16 DPCT1117:3: There is atomicInc/Dec operation on 'a1' and value of 'a1' was scaled by migration for performance, refer to DPCT1116. Using value of 'a1' through 'd_addr' should also be scaled by dividing or multiplying 2. You may need to adjust the code.
17 */
18 *((void **)&d_addr) = a1.get_ptr();
19 q_ct1.memset(d_addr, 2, sizeof(unsigned int)).wait();
20 q_ct1.submit(
21 [&](sycl::handler &cgh) {
22 a1.init();
23 auto a1_ptr_ct1 = a1.get_ptr();
24 cgh.parallel_for(
25 sycl::nd_range<3>(sycl::range<3>(1, 1, 100), sycl::range<3>(1, 1, 100)),
26 [=](sycl::nd_item<3> item_ct1) {
27 kernel1(*a1_ptr_ct1);
28 });
29 });
30 ...
31 q_ct1.memcpy(&result, d_addr, sizeof(unsigned int)).wait();
32 ...
33}
which is rewritten to:
1dpct::global_memory<unsigned int, 0> a1;
2void kernel(unsigned int &a1){
3 ...
4 unsigned int old_val = dpct::atomic_fetch_add<sycl::access::address_space::generic_space>(&a1, 2) / 2;
5 ...
6}
7int main(){
8 dpct::device_ext &dev_ct1 = dpct::get_current_device();
9 sycl::queue &q_ct1 = dev_ct1.default_queue();
10 unsigned int result;
11 unsigned int *d_addr;
12 *((void **)&d_addr) = a1.get_ptr();
13 q_ct1.memset(d_addr, 2 * 2, sizeof(unsigned int)).wait();
14 q_ct1.submit(
15 [&](sycl::handler &cgh) {
16 a1.init();
17 auto a1_ptr_ct1 = a1.get_ptr();
18 cgh.parallel_for(
19 sycl::nd_range<3>(sycl::range<3>(1, 1, 100), sycl::range<3>(1, 1, 100)),
20 [=](sycl::nd_item<3> item_ct1) {
21 kernel1(*a1_ptr_ct1);
22 });
23 });
24 ...
25 q_ct1.memcpy(&result, d_addr, sizeof(unsigned int)).wait();
26 result = result / 2;
27 ...
28}