Summary: | sys-cluster/openmpi-5.0.6 fails with USE=cuda and dev-util/nvidia-cuda-toolkit-12.6.1 | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Benjamin Schulz <schulz.benjamin> |
Component: | Current packages | Assignee: | Gentoo Cluster Team <cluster> |
Status: | UNCONFIRMED --- | ||
Severity: | normal | CC: | foufou33, russell.davie, schulz.benjamin, serhiihatcan, yuyuyak |
Priority: | Normal | Keywords: | PATCH |
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
See Also: |
https://github.com/open-mpi/ompi/issues/12924 https://github.com/open-mpi/ompi/pull/12934 |
||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: | build.log as tar.gz |
Description
Benjamin Schulz
2024-12-09 06:58:37 UTC
Created attachment 913611 [details]
build.log as tar.gz
it compiles with USE=-cuda at least. aliasing -finline-functions -c coll_cuda_module.c -fPIC -DPIC -o .libs/coll_cuda_module.o coll_cuda_module.c: In function ‘mca_coll_cuda_comm_query’: coll_cuda_module.c:107:42: error: assignment to ‘mca_coll_base_module_reduce_local_fn_t’ {aka ‘int (*)(const void *, void *, int, struct ompi_datatype_t *, struct ompi_op_t *, struct mca_coll_base_module_2_4_0_t *)’} from incompatible pointer type ‘int (*)(const void *, void *, size_t, struct ompi_datatype_t *, struct ompi_op_t *, mca_coll_base_module_t *)’ {aka ‘int (*)(const void *, void *, long unsigned int, struct ompi_datatype_t *, struct ompi_op_t *, struct mca_coll_base_module_2_4_0_t *)’} [-Wincompatible-pointer-types] 107 | cuda_module->super.coll_reduce_local = mca_coll_cuda_reduce_local; | ^ make[2]: *** [Makefile:1556: coll_cuda_module.lo] Error 1 make[2]: Leaving directory '/var/tmp/portage/sys-cluster/openmpi-5.0.6/work/openmpi-5.0.6/ompi/mca/coll/cuda' make[2]: *** Waiting for unfinished jobs.... make[2]: Entering directory '/var/tmp/portage/sys-cluster/openmpi-5.0.6/work/openmpi-5.0.6/ompi/mca/coll/cuda' /bin/bash ../../../../libtool --tag=CC --mode=compile x86_64-pc-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../ompi/include -I../../../../oshmem/include -I../../../../ompi/mpiext/cuda/c -I../../../../ompi/mpiext/rocm/c -iquote../../../.. -I/usr/include/pmix -DNDEBUG -march=native -O3 -pipe -fno-strict-aliasing -finline-fu it looks looks more 5.0.6's faults than 12.6.1. I went back to cuda 12.4.x and still the same probleme. Tried my luck on their bugtracker and found this : https://github.com/open-mpi/ompi/issues/12924 and here's their patch (untested): https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/12934.patch creating the directory /etc/portage/patches/sys-cluster/openmpi/ and putting that patch in there works for me. It compiles with the recent cuda then. So please include it and stabilize I had same problem on AMD zen2 machine, but an Intel laptop worked just fine. Patch worked for me also, thanks Benjamin. This issue is now fixed in next version 5.0.7, reported on Feb 17 2025. https://github.com/open-mpi/ompi/issues/12924#issuecomment-2661628176 Thanks Benjamin, for 5.0.6, patched worked nicely! |