Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 907967 - dev-util/rocprofiler-5.3.3: Error: failed to load 'gfx803_SimpleConvolution.hsaco' Aborted (core dumped)
Summary: dev-util/rocprofiler-5.3.3: Error: failed to load 'gfx803_SimpleConvolution.h...
Status: RESOLVED WORKSFORME
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal
Assignee: Gentoo Science Related Packages
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-07 01:36 UTC by Tom Li
Modified: 2023-06-07 06:56 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Li 2023-06-07 01:36:45 UTC
After installing dev-util/rocprofiler-5.3.3, running the command "rocprof-ctrl" causes a crash immediately.

> Error: failed to load 'gfx803_SimpleConvolution.hsaco'
> rocprof-ctrl: /var/tmp/portage/dev-util/rocprofiler-5.3.3/work/rocprofiler-
> rocm-5.3.3/test/util/hsa_rsrc_factory.cpp:589: bool 
> HsaRsrcFactory::LoadAndFinalize(const AgentInfo*, const char*, const char*, 
> hsa_executable_t*, hsa_executable_symbol_t*): Assertion `false' failed.
> Aborted (core dumped)

Checking the source code revealed that dev-util/rocprofiler wants to run some tests in the "rocprofiler-rocm-5.3.3/test" directory, such as "simple_convolution". This directory has its own CMake file, but they're not installed by the ebuild.
Comment 1 Tom Li 2023-06-07 03:15:45 UTC
The build scripts and "rocprof-ctrl" are completely broken, what a nightmare.

To build the test kernels properly, one needs to enable the "mytest" target by default. I used this patch:

diff -uprN rocprofiler-rocm-5.3.3/test/CMakeLists.txt rocprofiler-rocm-5.3.3.patch/test/CMakeLists.txt
--- rocprofiler-rocm-5.3.3/test/CMakeLists.txt	2022-10-17 20:34:10.000000000 -0000
+++ rocprofiler-rocm-5.3.3.patch/test/CMakeLists.txt	2023-06-07 02:20:52.298949697 -0000
@@ -76,7 +76,7 @@ set ( TEST_NAME simple_convolution )
 set ( KERN_SRC ${TEST_DIR}/${TEST_NAME}/${TEST_NAME}.cpp )
 
 ## Building test kernels
-add_custom_target( mytest
+add_custom_target( mytest ALL
   COMMAND sh -xc "${TEST_DIR}/../bin/build_kernel.sh ${TEST_DIR}/${DUMMY_NAME}/${DUMMY_NAME} ${PROJECT_BINARY_DIR} '${ROCM_ROOT_DIR}' '${GPU_TARGETS}'"
   COMMAND sh -xc "${TEST_DIR}/../bin/build_kernel.sh ${TEST_DIR}/${TEST_NAME}/${TEST_NAME} ${PROJECT_BINARY_DIR} '${ROCM_ROOT_DIR}' '${GPU_TARGETS}'"
 )

Then you need to remove the search path from build_kernel.sh:

diff -uprN rocprofiler-rocm-5.3.3/bin/build_kernel.sh rocprofiler-rocm-5.3.3.patch/bin/build_kernel.sh
--- rocprofiler-rocm-5.3.3/bin/build_kernel.sh	2022-10-17 20:34:10.000000000 -0000
+++ rocprofiler-rocm-5.3.3.patch/bin/build_kernel.sh	2023-06-07 02:32:01.969350315 -0000
@@ -22,10 +22,6 @@ if [ -z "$ROCM_DIR" ] ; then
 fi
 
 if [ -z "$TGT_LIST" ] ; then
-  TGT_LIST=`$ROCM_DIR/bin/rocminfo | grep "amdgcn-amd-amdhsa--" | head -n 1 | sed -n "s/^.*amdgcn-amd-amdhsa--\(\w*\).*$/\1/p"`
-fi
-
-if [ -z "$TGT_LIST" ] ; then
   echo "Error: GPU targets not found"
   exit 1
 fi


Next, patch the search path via sed in the ebuild:

    local targets="$(get_amdgpu_flags)"
    targets="${targets//;}"
    sed -e "s,ROCM_DIR=\$3,ROCM_DIR=\"/usr\",g" -i bin/build_kernel.sh || die
    sed -e "s,\$ROCM_DIR/amdgcn,\"/usr/lib/amdgcn\",g" -i bin/build_kernel.sh
    sed -e "s,TGT_LIST=\$4,TGT_LIST=\"${targets}\",g" -i bin/build_kernel.sh || die

Next, include the rocm eclass in the ebuild,

ROCM_VERSION=${PV}
inherit rocm

Even then, it's still broken because the kernels are not automatically installed. One needs to copy

rocprofiler-rocm-5.3.3_build/*.hsaco to /usr/libexec/rocprofiler/

Then, it's still broken, because rocprof-ctrl only search the current working directory, not any system path...

I think disabling all tests from rocprofiler in the source code may be a better option...
Comment 2 Tom Li 2023-06-07 03:55:29 UTC
Applying this patch can disable the missing self-test:

diff -uprN rocprofiler-rocm-5.3.3/test/app/test.cpp rocprofiler-rocm-5.3.3.patch/test/app/test.cpp
--- rocprofiler-rocm-5.3.3/test/app/test.cpp	2022-10-17 20:34:10.000000000 -0000
+++ rocprofiler-rocm-5.3.3.patch/test/app/test.cpp	2023-06-07 03:17:33.672871523 -0000
@@ -51,7 +51,7 @@ void thread_fun(const int kiter, const i
   for (int i = 0; i < kiter; ++i) {
     for (uint32_t n = 0; n < agents_number; ++n) {
       // RunKernel<DummyKernel, TestAql>(0, NULL, agent_info[n], queue[n], diter);
-      RunKernel<SimpleConvolution, TestAql>(0, NULL, agent_info[n], queue[n], diter);
+      // RunKernel<SimpleConvolution, TestAql>(0, NULL, agent_info[n], queue[n], diter);
     }
   }


But it doesn't make much sense to fix that. I opened this bug during an investigation of a segmentation fault of

$ rocprof --list-basic
RPL: on '230607_034945' from '/usr' in '/root':
Basic HW counters:
/usr/bin/rocprof: line 389:   574 Segmentation fault      (core dumped) /usr/bin/rocprof-ctrl

So I thought the missing "SimpleConvolution.hsaco" was the culprit and I reported this bug. I just realized that it's a red herring! Both crashes are completely unrelated. The true culprit is librocprofiler64.so.

$ HSA_TOOLS_LIB="/usr/lib64/librocprofiler64.so" rocprof-ctrl
> GPU agents :
> agent[0] :
>> Name : gfx803
>> APU : 0
>> HSAIL profile : 0
>> Max Wave Size : 64
>> Max Queue Size : 131072
>> CU number : 36
>> Waves per CU : 40
>> SIMDs per CU : 4
>> SE number : 4
>> Shader Arrays per SE : 1
Segmentation fault (core dumped)

According to the upstream bug report [1], this crash is related to the missing hsa-amd-aqlprofile. And according to Bug 716948,

> For the record, this profiler has long since been deprecated in favour of RCP (https://github.com/GPUOpen-Tools/radeon_compute_profiler). Between that and it being proprietary, I would very much advise against adding it to the tree. And yes, candrews and I *will* eventually get to packaging RCP for Gentoo :-)

So it's pointless to fix this bug at this point.

Closed as WORKSFORME.

[1] https://github.com/RadeonOpenCompute/ROCm/issues/1328
Comment 3 Tom Li 2023-06-07 06:56:03 UTC
In case anyone wants a workaround, I've documented the rocprofile's proprietary hsa-amd-aqlprofile dependency at Gentoo Wiki: https://wiki.gentoo.org/wiki/Rocprofiler