Attached are two files, main.cpp and mdspan.h. mdspan.h is somewhat an extension for the c++23 mdspan class, just that it works with extents on the heap and on the stack, and that it contains code for gpu offloading. It has a membervariable, called datastruct, that can be mapped to gpu. It also contains various mathematical functions. for example, for matrix multiplication and the lu decomposition. The matrix multiplication has flags for gpu offload. if that is set to true, mdspans datastruct object is extracted and offloaded, where the matrices are multiplied. In the main program, two matrices are multiplied on gpu, and that works if compiled with clang-20.0.0. Now uncomment the commented out lines in main.cpp. they create a third mdspan object, with totally different data. and then it makes an lu decomposition which I know is possible. A matrix multiplication is involved there. But the flags of the lu decomposition are set such that it (and the multiplication) wont use the gpu. If used alone, the lu decomposition works. But together with the 2 calls to matrix multiply on gpu, the program will crash with the following note: ordinary matrix multiplication, on gpu "PluginInterface" error: Faliure to copy data from device to host. Pointers: host = 0x00007ffd7e0ec408, device = 0x00007f8647801028, size = 1: Error in cuMemcpyDtoHAsync: an illegal memory access was encountered omptarget error: Copying data from device failed. omptarget error: Call to targetDataEnd failed, abort target. omptarget error: Failed to process data after launching the kernel. omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options. All this that does not make much sense. How can a program offloading a calculation on gpu fail if i add a function that works on cpu which has nothing to do with the gpu functions? 100%] Linking CXX executable arraytest /usr/bin/cmake -E cmake_link_script CMakeFiles/arraytest.dir/link.txt --verbose=1 nvlink warning : Stack size for entry function '__omp_offloading_10306_1944f12__Z16lu_decompositionIdSt6vectorImSaImEEEvR6mdspanIT_T0_ES7_S7_32matrix_multiplication_parametersmbbm_l3121' cannot be statically determined nvlink warning : Stack size for entry function '__omp_offloading_10306_1944f12__Z19matrix_multiply_dotIdSt6vectorImSaImEES2_S2_EbR6mdspanIT_T0_ERS3_IS4_T1_ERS3_IS4_T2_Ebmm_l3473' cannot be statically determined nvlink warning : Stack size for entry function '__omp_offloading_10306_1944f12__Z19matrix_multiply_dotIdSt6vectorImSaImEES2_S2_EbR6mdspanIT_T0_ERS3_IS4_T1_ERS3_IS4_T2_Ebmm_l3432' cannot be statically determined Such warnings are, according to a google search, printed if an algorithm contains recursion. But there is no recursion on matrix multiply or on the lu decomposition.... What the lu decomposition has, however, is that it allocates a matrix which is often larger than the array where it works on. But, in that case, this is done on cpu. The matrix multiplication function has a tile option. then it just uploads submatrices to gpu. If that flag is set on, then the matrix multiplication will fail on its own. It maybe that the device mapper insists that the entire array is copied before it can copy it back? but that does not make too much sense for very large data. Also, in the above example, the lu decomposition does not copy to gpu at all. How can a working gpu function fail if i just add code that works solely on cpu with entirely different data when this function is known to work successfully on its own? Reproducible: Always
Created attachment 916411 [details] an extension of the mdspan class of c++ with gpu offload and some math functions
Created attachment 916412 [details] a small test program. as usual, compile with clang++ -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda lrt lm lc lstdc++ In order to make clang generate code where the device mapper fails, uncomment the lines // // //mdspan<double, std::vector<size_t>> A3(A3_data.data(), true, {rows, cols}); //mdspan<double, std::vector<size_t>> L(L_data.data(), true, {rows, cols}); //mdspan<double, std::vector<size_t>> U(U_data.data(), true, {rows, cols}); //cout<<"LU decomposition of" <<endl; // //printmatrix(A3); //matrix_multiplication_parameters par; //lu_decomposition(A3,L,U,par,0,false,false,0); // //printmatrix(L); // //printmatrix(U); // // that were commented out. Note that they do not use the gpu on its own. The crash appears only together with size_t rows = 4, cols = 4; vector<double>A_data(16,0); vector<double>B_data(16,0); vector<double>C_data(16,0); // Allocate data for (size_t i = 0; i < rows * cols; ++i) { A_data[i] = i + 1; // Example initialization B_data[i] = i ; // Example initialization } mdspan<double, std::vector<size_t>> A(A_data.data(), true, {rows, cols}); mdspan<double, std::vector<size_t>> B(B_data.data(), true, {rows, cols}); mdspan<double, std::vector<size_t>> C(C_data.data(), true, {rows, cols}); cout<<"ordinary matrix multiplication, on gpu"<<endl; matrix_multiply_dot(A, B, C,true, 0,0); printmatrix(C); size_t rows4 = 4, cols4 = 4; vector<double>A_data4(16,0); vector<double>B_data4(16,0); vector<double>C_data4(16,0); // Allocate data for (size_t i = 0; i < rows * cols; ++i) { A_data4[i] = i + 1; // Example initialization B_data4[i] = i ; // Example initialization } mdspan<double, std::vector<size_t>> A4(A_data4.data(), true, {rows4, cols4}); mdspan<double, std::vector<size_t>> B4(B_data4.data(), true, {rows4, cols4}); mdspan<double, std::vector<size_t>> C4(C_data4.data(), true, {rows4, cols4}); cout<<"ordinary matrix multiplication, on gpu"<<endl; matrix_multiply_dot(A4, B4, C4,true, 0,0); printmatrix(C4);
Please report this upstream.
I will do. I thought that some upstream people are already here. And maybe others can confirm this... Where is the clang bug tracker for such things? By the way, gcc claimed something about "alias definitions would be forbidden"... but there are no alias definitions in mdspan.h... now on my systems, after i added memmaped files for the strassen algorithm, gcc has difficulties to link to the symbol unmap. when i get that compiling again, i will try to make gcc compile it and then see this does...
https://github.com/llvm/llvm-project/issues