Summary: | sys-cluster/openmpi-1.5.3-r1: mpirun hangs or fails with Segmentation faults | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Juergen Rose <rose> |
Component: | [OLD] Library | Assignee: | Justin Bronder (RETIRED) <jsbronder> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | cluster |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
ex1a.c
Makefile to generate ex1a |
Description
Juergen Rose
2011-07-03 15:47:30 UTC
Created attachment 278935 [details]
ex1a.c
Created attachment 278937 [details]
Makefile to generate ex1a
The same happens with openmpi-1.5.3-r2. Hello. I tried to reproduce this error but did not succeed. Could you try to run your example with a debugger (e.g. ddd) to see exactly where the problem comes from. Printf debugging is always very bad, as e.g. stdout is buffered and messages do not appear in the order they should. ;-] # compile with mpicc -g ex1a.c -o ex1a # then run with mpirun -np 2 ddd ex1a It could also be helpful, if you build Open MPI by hand with the --enable-debug configure option to locate the problem, if it is not ebuild related. Hallo Christoph, if I run the program under DDD: rose@orca:/home/rose/Txt/src/Test/C/MPI/Ex1a(11)$ mpirun -np 2 ddd ex1a_debug I see in the first DDD window: (gdb) run [Thread debugging using libthread_db enabled] argc= 1 argv=0x7fffffffc628 i= 0[ 1] argv[i]=|/home_orca/rose/Txt/src/Test/C/MPI/Ex1a/ex1a_debug| before 'MPI_Init(&argc,&argv)' [New Thread 0x7ffff40cd700 (LWP 14357)] [New Thread 0x7fffef6c6700 (LWP 14358)] rc=0 MPI_SUCCESS=0 Hello, world. I am 0 of 2 on orca WE have 2 processes Hello 1 Processor 1 at node orca reporting for duty rank= 0 numtask= 2 processor_name=orca, before 'sleep(10)' rank= 0 numtask= 2 processor_name=orca, after 'sleep(10)' rank= 0, before 'MPI_Finalize' [Thread 0x7ffff40cd700 (LWP 14357) exited] [Thread 0x7fffef6c6700 (LWP 14358) exited] [New Thread 0x7fffef6c6700 (LWP 14365)] [Thread 0x7fffef6c6700 (LWP 14365) exited] [New Thread 0x7fffef6c6700 (LWP 14367)] [Thread 0x7fffef6c6700 (LWP 14367) exited] Program exited normally. (gdb) In the second DDD window: I see: (gdb) run [Thread debugging using libthread_db enabled] argc= 1 argv=0x7fffffffc628 i= 0[ 1] argv[i]=|/home_orca/rose/Txt/src/Test/C/MPI/Ex1a/ex1a_debug| before 'MPI_Init(&argc,&argv)' [New Thread 0x7ffff40cd700 (LWP 14362)] [New Thread 0x7fffef6c6700 (LWP 14363)] rc=0 MPI_SUCCESS=0 Hello, world. I am 1 of 2 on orca rank= 1 numtask= 2 processor_name=orca, before 'sleep(10)' rank= 1 numtask= 2 processor_name=orca, after 'sleep(10)' rank= 1, before 'MPI_Finalize' [Thread 0x7ffff40cd700 (LWP 14362) exited] [Thread 0x7fffef6c6700 (LWP 14363) exited] [New Thread 0x7fffef6c6700 (LWP 14364)] [Thread 0x7fffef6c6700 (LWP 14364) exited] [New Thread 0x7fffef6c6700 (LWP 14366)] [Thread 0x7fffef6c6700 (LWP 14366) exited] Program exited normally. (gdb) So everything seems to be fine. If I run the program compiled without "-g" directly, it hangs at the beginning, I see only: rose@orca:/home/rose/Txt/src/Test/C/MPI/Ex1a(12)$ mpirun -np 2 ex1a argc= 1 argv=0x7fff0e06d478 i= 0[ 1] argv[i]=|ex1a| ^C^\Verlassen I can only kill it with ^\. What a version of gcc and glibc do you use? What can I still test? Regards Juergen If I run the ex1a version compiled with the "-g" flag without the debugger, it hangs as well: rose@orca:/home/rose/Txt/src/Test/C/MPI/Ex1a(20)$ mpirun -np 2 ex1a_debug argc= 1 argv=0x7fff920e66e8 i= 0[ 1] argv[i]=|ex1a_debug| ^\Verlassen Still some additional information, if I run 'mpirun -np 2 ddd ex1a_debug' with openmpi-1.5.3-r2, mpirun does not finish. I do not get a prompt after quitting ddd. I have to kill mpirun with ^C: rose@orca:/home/rose/Txt/src/Test/C/MPI/Ex1a(11)$ mpirun -np 2 ddd ex1a_debug ^C rose@orca:/home/rose/Txt/src/Test/C/MPI/Ex1a(12)$ If I run with openmpi-1.4.3, mpirun finishes after quitting ddd: rose@caiman:/home/rose/Txt/src/Test/C/MPI/Ex1a(7)$ mpirun -np 2 ddd ex1a_debug rose@caiman:/home/rose/Txt/src/Test/C/MPI/Ex1a(8)$ So I assume that the error is not in ex1a_debug but in mpirun belonging to openmpi-1.5.3-r2. Hello Juergen, The problem seems to be the mpi-thread use flag which enables mpi-threads and progress threads at the same time. Maybe there should be a separate use flag for progress-threads in the Open MPI ebuild as they have nothing to do with mpi-threads in the future. Disabling mpi-threads solved the problem for me so far. Regards Christoph Danke Christoph, your hint works. After removing the mpi-threads flag from USE in /etc/make.conf, doing 'emerge -uvND world' and recompiling my test program it seems to work correctly. No longer reproducible with openmpi-1.8.7 (with mpi-threads enabled). |