After installing dev-util/rocprofiler-5.3.3, running the command "rocprof-ctrl" causes a crash immediately. > Error: failed to load 'gfx803_SimpleConvolution.hsaco' > rocprof-ctrl: /var/tmp/portage/dev-util/rocprofiler-5.3.3/work/rocprofiler- > rocm-5.3.3/test/util/hsa_rsrc_factory.cpp:589: bool > HsaRsrcFactory::LoadAndFinalize(const AgentInfo*, const char*, const char*, > hsa_executable_t*, hsa_executable_symbol_t*): Assertion `false' failed. > Aborted (core dumped) Checking the source code revealed that dev-util/rocprofiler wants to run some tests in the "rocprofiler-rocm-5.3.3/test" directory, such as "simple_convolution". This directory has its own CMake file, but they're not installed by the ebuild.
The build scripts and "rocprof-ctrl" are completely broken, what a nightmare. To build the test kernels properly, one needs to enable the "mytest" target by default. I used this patch: diff -uprN rocprofiler-rocm-5.3.3/test/CMakeLists.txt rocprofiler-rocm-5.3.3.patch/test/CMakeLists.txt --- rocprofiler-rocm-5.3.3/test/CMakeLists.txt 2022-10-17 20:34:10.000000000 -0000 +++ rocprofiler-rocm-5.3.3.patch/test/CMakeLists.txt 2023-06-07 02:20:52.298949697 -0000 @@ -76,7 +76,7 @@ set ( TEST_NAME simple_convolution ) set ( KERN_SRC ${TEST_DIR}/${TEST_NAME}/${TEST_NAME}.cpp ) ## Building test kernels -add_custom_target( mytest +add_custom_target( mytest ALL COMMAND sh -xc "${TEST_DIR}/../bin/build_kernel.sh ${TEST_DIR}/${DUMMY_NAME}/${DUMMY_NAME} ${PROJECT_BINARY_DIR} '${ROCM_ROOT_DIR}' '${GPU_TARGETS}'" COMMAND sh -xc "${TEST_DIR}/../bin/build_kernel.sh ${TEST_DIR}/${TEST_NAME}/${TEST_NAME} ${PROJECT_BINARY_DIR} '${ROCM_ROOT_DIR}' '${GPU_TARGETS}'" ) Then you need to remove the search path from build_kernel.sh: diff -uprN rocprofiler-rocm-5.3.3/bin/build_kernel.sh rocprofiler-rocm-5.3.3.patch/bin/build_kernel.sh --- rocprofiler-rocm-5.3.3/bin/build_kernel.sh 2022-10-17 20:34:10.000000000 -0000 +++ rocprofiler-rocm-5.3.3.patch/bin/build_kernel.sh 2023-06-07 02:32:01.969350315 -0000 @@ -22,10 +22,6 @@ if [ -z "$ROCM_DIR" ] ; then fi if [ -z "$TGT_LIST" ] ; then - TGT_LIST=`$ROCM_DIR/bin/rocminfo | grep "amdgcn-amd-amdhsa--" | head -n 1 | sed -n "s/^.*amdgcn-amd-amdhsa--\(\w*\).*$/\1/p"` -fi - -if [ -z "$TGT_LIST" ] ; then echo "Error: GPU targets not found" exit 1 fi Next, patch the search path via sed in the ebuild: local targets="$(get_amdgpu_flags)" targets="${targets//;}" sed -e "s,ROCM_DIR=\$3,ROCM_DIR=\"/usr\",g" -i bin/build_kernel.sh || die sed -e "s,\$ROCM_DIR/amdgcn,\"/usr/lib/amdgcn\",g" -i bin/build_kernel.sh sed -e "s,TGT_LIST=\$4,TGT_LIST=\"${targets}\",g" -i bin/build_kernel.sh || die Next, include the rocm eclass in the ebuild, ROCM_VERSION=${PV} inherit rocm Even then, it's still broken because the kernels are not automatically installed. One needs to copy rocprofiler-rocm-5.3.3_build/*.hsaco to /usr/libexec/rocprofiler/ Then, it's still broken, because rocprof-ctrl only search the current working directory, not any system path... I think disabling all tests from rocprofiler in the source code may be a better option...
Applying this patch can disable the missing self-test: diff -uprN rocprofiler-rocm-5.3.3/test/app/test.cpp rocprofiler-rocm-5.3.3.patch/test/app/test.cpp --- rocprofiler-rocm-5.3.3/test/app/test.cpp 2022-10-17 20:34:10.000000000 -0000 +++ rocprofiler-rocm-5.3.3.patch/test/app/test.cpp 2023-06-07 03:17:33.672871523 -0000 @@ -51,7 +51,7 @@ void thread_fun(const int kiter, const i for (int i = 0; i < kiter; ++i) { for (uint32_t n = 0; n < agents_number; ++n) { // RunKernel<DummyKernel, TestAql>(0, NULL, agent_info[n], queue[n], diter); - RunKernel<SimpleConvolution, TestAql>(0, NULL, agent_info[n], queue[n], diter); + // RunKernel<SimpleConvolution, TestAql>(0, NULL, agent_info[n], queue[n], diter); } } But it doesn't make much sense to fix that. I opened this bug during an investigation of a segmentation fault of $ rocprof --list-basic RPL: on '230607_034945' from '/usr' in '/root': Basic HW counters: /usr/bin/rocprof: line 389: 574 Segmentation fault (core dumped) /usr/bin/rocprof-ctrl So I thought the missing "SimpleConvolution.hsaco" was the culprit and I reported this bug. I just realized that it's a red herring! Both crashes are completely unrelated. The true culprit is librocprofiler64.so. $ HSA_TOOLS_LIB="/usr/lib64/librocprofiler64.so" rocprof-ctrl > GPU agents : > agent[0] : >> Name : gfx803 >> APU : 0 >> HSAIL profile : 0 >> Max Wave Size : 64 >> Max Queue Size : 131072 >> CU number : 36 >> Waves per CU : 40 >> SIMDs per CU : 4 >> SE number : 4 >> Shader Arrays per SE : 1 Segmentation fault (core dumped) According to the upstream bug report [1], this crash is related to the missing hsa-amd-aqlprofile. And according to Bug 716948, > For the record, this profiler has long since been deprecated in favour of RCP (https://github.com/GPUOpen-Tools/radeon_compute_profiler). Between that and it being proprietary, I would very much advise against adding it to the tree. And yes, candrews and I *will* eventually get to packaging RCP for Gentoo :-) So it's pointless to fix this bug at this point. Closed as WORKSFORME. [1] https://github.com/RadeonOpenCompute/ROCm/issues/1328
In case anyone wants a workaround, I've documented the rocprofile's proprietary hsa-amd-aqlprofile dependency at Gentoo Wiki: https://wiki.gentoo.org/wiki/Rocprofiler