Summary: | [gentoo-science] mlx4 there is a mismatch between the kernel and the userspace libraries | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Vittorio <vitto.giova> |
Component: | Current packages | Assignee: | Gentoo Cluster Team <cluster> |
Status: | RESOLVED NEEDINFO | ||
Severity: | normal | CC: | alexxy, jlec, jsbronder, mschoepf |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- |
Description
Vittorio
2009-10-14 13:10:57 UTC
output of a mpirun launch mlx4: There is a mismatch between the kernel and the userspace libraries: Kernel does not support XRC. Exiting. CMA: unable to open RDMA device [randori:15932] *** Process received signal *** [randori:15932] Signal: Segmentation fault (11) [randori:15932] Signal code: Address not mapped (1) [randori:15932] Failing at address: 0x10c [randori:15932] [ 0] /lib/libpthread.so.0 [0x7f78b6b78a10] [randori:15932] [ 1] /usr/lib/libibverbs.so.1(ibv_close_device+0x21) [0x7f78b2d4b8c1] [randori:15932] [ 2] /usr/lib/librdmacm.so.1 [0x7f78b2f53d2e] [randori:15932] [ 3] /usr/lib/librdmacm.so.1 [0x7f78b2f53ee1] [randori:15932] [ 4] /usr/lib/librdmacm.so.1(rdma_create_event_channel+0x12) [0x7f78b2f55e42] [randori:15932] [ 5] /usr/lib64/mpi/mpi-ompi/usr/lib64/openmpi/mca_btl_openib.so [0x7f78b2b216a2] [randori:15932] [ 6] /usr/lib64/mpi/mpi-ompi/usr/lib64/openmpi/mca_btl_openib.so [0x7f78b2b265aa] [randori:15932] [ 7] /usr/lib64/mpi/mpi-ompi/usr/lib64/openmpi/mca_btl_openib.so [0x7f78b2b22014] [randori:15932] [ 8] /usr/lib64/mpi/mpi-ompi/usr/lib64/openmpi/mca_btl_openib.so [0x7f78b2b10e8d] [randori:15932] [ 9] /usr/lib64/mpi/mpi-ompi/usr/lib64/libmpi.so.0(mca_btl_base_select+0x1ba) [0x7f78b7b6237a] [randori:15932] [10] /usr/lib64/mpi/mpi-ompi/usr/lib64/openmpi/mca_bml_r2.so [0x7f78b3362911] [randori:15932] [11] /usr/lib64/mpi/mpi-ompi/usr/lib64/libmpi.so.0(mca_bml_base_init+0x9f) [0x7f78b7b61b0f] [randori:15932] [12] /usr/lib64/mpi/mpi-ompi/usr/lib64/openmpi/mca_pml_ob1.so [0x7f78b376d01f] [randori:15932] [13] /usr/lib64/mpi/mpi-ompi/usr/lib64/libmpi.so.0(mca_pml_base_select+0x20e) [0x7f78b7b6ce7e] [randori:15932] [14] /usr/lib64/mpi/mpi-ompi/usr/lib64/libmpi.so.0 [0x7f78b7b23ff8] [randori:15932] [15] /usr/lib64/mpi/mpi-ompi/usr/lib64/libmpi.so.0(PMPI_Init+0x179) [0x7f78b7b46679] [randori:15932] [16] mpitest(main+0x2b) [0x400b47] [randori:15932] [17] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f78b68335c6] [randori:15932] [18] mpitest [0x400a59] [randori:15932] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 15932 on node randori exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- [randori:15929] *** Process received signal *** [randori:15929] Signal: Segmentation fault (11) [randori:15929] Signal code: Address not mapped (1) [randori:15929] Failing at address: 0x7f7fdcec2ae0 Segmentation fault Hopefully you are the right addresses, otherwise send it back to me. what do you mean by "right addresses"? if you mean access rights, yes as i'm running by root, otherwise if by address you mean the other nodes of the clusters, yes they are mapped correctly through opensm (In reply to comment #3) > what do you mean by "right addresses"? The ones who I assigned the bug too. I didn't really get which package gave you the trouble so I chosse a little blind the maintainer who I assigned the bug to. Perhaps I missed someone. (In reply to comment #3) > what do you mean by "right addresses"? if you mean access rights, yes as i'm > running by root, otherwise if by address you mean the other nodes of the > clusters, yes they are mapped correctly through opensm > yep. right address is me. but i cant test mlx4 driver since i only has mthca rdma devices. so what i can recomend is to try another kernel version or so. also what actual version of libmlx4 did you tried? this bug applies also with the unstable kernel 2.6.31-r4 Is this still a problem for you? If so, could you share some information about your hardware specification? (In reply to comment #7) > Is this still a problem for you? If so, could you share some information about > your hardware specification? I was having the same issue, if I recall correctly, I could resolve the issue by updating the firmware of the infiniband cards (We have some DELL OEM carda). But I also recall that I installed libibverbs and libmlx4 from source. There was some issue with the gentoo-ebuild... All I actually can recall is that this issue hat absolutely nothing to do with the kernel, as it was configured correctly. It was rather an issue with the userspace libs... Hope that helps... Googling a bit about this issue gave me the same informations - firmware upgrade resolved everyone's problem. |