948019 – clang llvm-core/clang-20.0.0 generates invalid code when uploading to gpu.

Bug 948019 - clang llvm-core/clang-20.0.0 generates invalid code when uploading to gpu.

Summary: clang llvm-core/clang-20.0.0 generates invalid code when uploading to gpu.

Status:	RESOLVED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	LLVM support project

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2025-01-13 03:58 UTC by Benjamin Schulz
Modified:	2025-01-13 05:21 UTC (History)
CC List:	1 user (show)

See Also:	https://github.com/llvm/llvm-project/issues/122668
Package list:
Runtime testing required:	---

Attachments
an extension of the mdspan class of c++ with gpu offload and some math functions (mdspan.h,117.02 KB, text/plain) 2025-01-13 04:00 UTC, Benjamin Schulz	Details
a small test program. (main.cpp,2.64 KB, text/x-c++src) 2025-01-13 04:04 UTC, Benjamin Schulz	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Benjamin Schulz 2025-01-13 03:58:44 UTC

Attached are two files, main.cpp and mdspan.h. mdspan.h is somewhat an extension for the c++23 mdspan class, just that it works with extents on the heap and on the stack, and that it contains code for gpu offloading.

It has a membervariable, called datastruct, that can be mapped to gpu.

It also contains various mathematical functions. for example, for matrix multiplication and the lu decomposition.

The matrix multiplication has flags for gpu offload. if that is set to true, mdspans datastruct object is extracted and offloaded, where the matrices are multiplied. 

In the main program, two matrices are multiplied on gpu, and that works if compiled with clang-20.0.0.

Now uncomment the commented out lines in main.cpp. they create a third mdspan object, with totally different data. and then it makes an lu decomposition which I know is possible. A matrix multiplication is involved there. But the flags of the lu decomposition are set such that it (and the multiplication) wont use the gpu.


If used alone, the lu decomposition works.

But together with the 2 calls to matrix multiply on gpu, the program will crash with the following note:

ordinary matrix multiplication, on gpu
"PluginInterface" error: Faliure to copy data from device to host. Pointers: host = 0x00007ffd7e0ec408, device = 0x00007f8647801028, size = 1: Error in cuMemcpyDtoHAsync: an illegal memory access was encountered
omptarget error: Copying data from device failed.
omptarget error: Call to targetDataEnd failed, abort target.
omptarget error: Failed to process data after launching the kernel.
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.

All this that does not make much sense. How can a program offloading a calculation on gpu fail if i add a function that works on cpu which has nothing to do with the gpu functions?

100%] Linking CXX executable arraytest
/usr/bin/cmake -E cmake_link_script CMakeFiles/arraytest.dir/link.txt --verbose=1
nvlink warning : Stack size for entry function '__omp_offloading_10306_1944f12__Z16lu_decompositionIdSt6vectorImSaImEEEvR6mdspanIT_T0_ES7_S7_32matrix_multiplication_parametersmbbm_l3121' cannot be statically determined
nvlink warning : Stack size for entry function '__omp_offloading_10306_1944f12__Z19matrix_multiply_dotIdSt6vectorImSaImEES2_S2_EbR6mdspanIT_T0_ERS3_IS4_T1_ERS3_IS4_T2_Ebmm_l3473' cannot be statically determined
nvlink warning : Stack size for entry function '__omp_offloading_10306_1944f12__Z19matrix_multiply_dotIdSt6vectorImSaImEES2_S2_EbR6mdspanIT_T0_ERS3_IS4_T1_ERS3_IS4_T2_Ebmm_l3432' cannot be statically determined
 


Such warnings are, according to a google search, printed if an algorithm contains recursion. But there is no recursion on matrix multiply or on the lu decomposition.... 

What the lu decomposition has, however, is that it allocates a matrix which is often larger than the array where it works on. But, in that case, this is done on cpu. The matrix multiplication function has a tile option. then it just uploads submatrices to gpu. If that flag is set on, then the matrix multiplication will fail on its own. It maybe that the device mapper insists that the entire array is copied before it can copy it back? but that does not make too much sense for very large data. 

Also, in the above example, the lu decomposition does not copy to gpu at all. 

How can a working gpu function fail if i just add code that works solely on cpu with entirely different data when this function is known to work successfully on its own?




Reproducible: Always

Comment 1 Benjamin Schulz 2025-01-13 04:00:24 UTC

Created attachment 916411 [details]
an extension of the mdspan class of c++ with gpu offload and some math functions

Comment 2 Benjamin Schulz 2025-01-13 04:04:08 UTC

Created attachment 916412 [details]
a small test program.

as usual, compile with 

clang++ -O3 -fopenmp   -fopenmp-targets=nvptx64-nvidia-cuda lrt lm lc lstdc++

In order to make clang generate code where the device mapper fails, uncomment the lines

//
//
//mdspan<double, std::vector<size_t>> A3(A3_data.data(), true, {rows, cols});
//mdspan<double, std::vector<size_t>> L(L_data.data(), true, {rows, cols});
//mdspan<double, std::vector<size_t>> U(U_data.data(), true, {rows, cols});
//cout<<"LU decomposition of" <<endl;
//
//printmatrix(A3);
//matrix_multiplication_parameters par;
//lu_decomposition(A3,L,U,par,0,false,false,0);
//
//printmatrix(L);
//
//printmatrix(U);
//
//


 that were commented out. Note that they do not use the gpu on its own. The crash appears only together with 


size_t rows = 4, cols = 4;

vector<double>A_data(16,0);
vector<double>B_data(16,0);
vector<double>C_data(16,0);

   // Allocate data
    for (size_t i = 0; i < rows * cols; ++i)
    {
        A_data[i] = i + 1; // Example initialization
        B_data[i] = i ; // Example initialization
    }

mdspan<double, std::vector<size_t>> A(A_data.data(), true, {rows, cols});
mdspan<double, std::vector<size_t>> B(B_data.data(), true, {rows, cols});
mdspan<double, std::vector<size_t>> C(C_data.data(), true, {rows, cols});


cout<<"ordinary matrix multiplication, on gpu"<<endl;
matrix_multiply_dot(A, B, C,true, 0,0);

printmatrix(C);


size_t rows4 = 4, cols4 = 4;

vector<double>A_data4(16,0);
vector<double>B_data4(16,0);
vector<double>C_data4(16,0);

   // Allocate data
    for (size_t i = 0; i < rows * cols; ++i)
    {
        A_data4[i] = i + 1; // Example initialization
        B_data4[i] = i ; // Example initialization
    }

mdspan<double, std::vector<size_t>> A4(A_data4.data(), true, {rows4, cols4});
mdspan<double, std::vector<size_t>> B4(B_data4.data(), true, {rows4, cols4});
mdspan<double, std::vector<size_t>> C4(C_data4.data(), true, {rows4, cols4});


cout<<"ordinary matrix multiplication, on gpu"<<endl;
matrix_multiply_dot(A4, B4, C4,true, 0,0);

printmatrix(C4);

Comment 3 Sam James archtester

2025-01-13 04:09:12 UTC

Please report this upstream.

Comment 4 Benjamin Schulz 2025-01-13 04:22:03 UTC

I will do. I thought that some upstream people are already here. 
And maybe others can confirm this...


Where is the clang bug tracker for such things?


By the way, gcc claimed something about "alias definitions would be forbidden"... but there are no alias definitions in mdspan.h... now on my systems, after i added memmaped files for the strassen algorithm, gcc has difficulties to link to the symbol unmap.  when i get that compiling again, i will try to make gcc compile it and then see this does...

Comment 5 Sam James archtester

2025-01-13 04:23:56 UTC

https://github.com/llvm/llvm-project/issues