Updated ebuild for mpich2 (version 1.3.2p1 and 1.4 rc2). Based on the ebuild currently in portage, with the following modifications: - removed support for MPD. MPD is deprecated. This also solves the conflict report in bug 342851. - Removed patches for issues that have since been fixed in mpich2. - Removed python logic (was only required for the MPD launcher) - Removed redundant configure options - Fixed issue in install(): parallel make install not supported Reproducible: Always
Created attachment 269065 [details] mpich2-1.3.2_p1.ebuild Ebuild for 1.3.2_p1 (also works for 1.4_rc2, except maybe for ARCH=~)
(In reply to comment #0) > Updated ebuild for mpich2 (version 1.3.2p1 and 1.4 rc2). Awesome, but what's been holding me up from pushing an updated ebuild is that a bunch of the tests fail. Have you made any headway there or is it only failing for me?
No, some tests are failing but I just checked with the mpich people and some tests are supposed to fail: - I/O tests should pass on a local filesystem. -> Here the issue might be that the tests run in the directory from which make test is called. There might be a sandbox/privilege issue? - large_message: is expected to fail on 'normal' systems - expected to fail for now: non_zero_root, bcastlength. I'll see if there is a way to skip the tests that might fail (due to system requirements or expected to fail).
A little bit more description on the tests: large_message: this is only expected to run on systems with a large amount of memory. If you have less than 8GB of memory, disable this test. bcastlength: this failure is not really an error in MPICH2. If the user writes a wrong program, MPICH2 tries to detect it, though it is not required to. This failure just means that MPICH2 is not able to detect this error in the user program. It can be disabled for now; I will file a ticket to re-enable it once we improve our internal checking in MPICH2. non-zero-length: this is a performance test. this only makes sense when the system is not shared with any other compute-intensive programs. If something else is running on the system, it will likely show up as a false-positive. If this is the case, it might be better for you to disable all performance tests. I/O tests: Are you running the tests on a local file system and not NFS? If yes, then the failures are not expected. Can you confirm?
After excluding large_message, bcastlength and non_zero_root the following tests are failing: Comments indicate that gtranksperf may spawn too many processes (I'm using a Q9550), but I'm not seeing any notes for the others. 1.3.2_p1: Unexpected output in gtranksperf: too much difference in MPI_Group_translate_ranks performance: Unexpected output in gtranksperf: time1=0.734379 time2=0.138209 Unexpected output in gtranksperf: (fabs(time1-time2)/time2)=4.313543 Unexpected output in gtranksperf: Found 1 errors Program gtranksperf exited without No Errors Unexpected output in scancel: In direct memory block for handle type REQUEST, 4 handles are still allocated Unexpected output in scancel: In direct memory block for handle type COMM, 2 handles are still allocated Unexpected output in scancel: [0] 24 at [0x00000000011b7398], mpid_vc.c[79] Unexpected output in scancel: [0] 32 at [0x00000000011b72c8], mpid_vc.c[79] Unexpected output in scancel: [0] 8 at [0x00000000011b70b8], local_proc.c[91] Unexpected output in scancel: [0] 8 at [0x00000000011b7008], local_proc.c[90] Unexpected output in scancel: [0] 32 at [0x00000000011b6e88], mpid_vc.c[79] Unexpected output in scancel: In direct memory block for handle type REQUEST, 3 handles are still allocated Unexpected output in pscancel: In direct memory block for handle type REQUEST, 4 handles are still allocated Unexpected output in pscancel: In direct memory block for handle type COMM, 2 handles are still allocated Unexpected output in pscancel: [0] 24 at [0x0000000000df4398], mpid_vc.c[79] Unexpected output in pscancel: [0] 32 at [0x0000000000df42c8], mpid_vc.c[79] Unexpected output in pscancel: [0] 8 at [0x0000000000df40b8], local_proc.c[91] Unexpected output in pscancel: [0] 8 at [0x0000000000df4008], local_proc.c[90] Unexpected output in pscancel: [0] 32 at [0x0000000000df3e88], mpid_vc.c[79] Unexpected output in pscancel: In direct memory block for handle type REQUEST, 3 handles are still allocated Unexpected output in cancelrecv: In direct memory block for handle type REQUEST, 1 handles are still allocated Unexpected output in distgraph1: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack: Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(185): MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=4, sources=0x15a2040, sourceweights=(nil), outdegree=4, destinations=0x15a2200, destweights=(nil), MPI_INFO_NULL, 2, comm_dist_graph=0x7fff180d276c) failed Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights Unexpected output in distgraph1: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack: Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(185): MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=4, sources=0x1bdb020, sourceweights=(nil), outdegree=4, destinations=0x1bdad30, destweights=(nil), MPI_INFO_NULL, 2, comm_dist_graph=0x7fffb91999dc) failed Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights Unexpected output in distgraph1: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack: Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(185): MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=4, sources=0x1d42020, sourceweights=(nil), outdegree=4, destinations=0x1d41d30, destweights=(nil), MPI_INFO_NULL, 2, comm_dist_graph=0x7fff5cd3769c) failed Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights Unexpected output in distgraph1: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack: Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(185): MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=4, sources=0x1157020, sourceweights=(nil), outdegree=4, destinations=0x1156d30, destweights=(nil), MPI_INFO_NULL, 2, comm_dist_graph=0x7ffffb20822c) failed Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights Unexpected output in distgraph1: [mpiexec@mejis] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP Unexpected output in distgraph1: APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15) Program distgraph1 exited without No Errors Unexpected output in dgraph_unwgt: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack: Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(185): MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=2, sources=0x7fff9354a3b0, sourceweights=(nil), outdegree=2, destinations=0x7fff9354a3a0, destweights=(nil), MPI_INFO_NULL, 1, comm_dist_graph=0x7fff9354a3cc) failed Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights Unexpected output in dgraph_unwgt: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack: Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(185): MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=2, sources=0x7fff6f5868e0, sourceweights=(nil), outdegree=2, destinations=0x7fff6f5868d0, destweights=(nil), MPI_INFO_NULL, 1, comm_dist_graph=0x7fff6f5868fc) failed Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights Unexpected output in dgraph_unwgt: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack: Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(185): MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=2, sources=0x7fff26aa3940, sourceweights=(nil), outdegree=2, destinations=0x7fff26aa3930, destweights=(nil), MPI_INFO_NULL, 1, comm_dist_graph=0x7fff26aa395c) failed Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights Unexpected output in dgraph_unwgt: [mpiexec@mejis] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP Unexpected output in dgraph_unwgt: APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15) Program dgraph_unwgt exited without No Errors Unexpected output in commcallx: Unexpected communicator Unexpected output in commcallx: Unexpected communicator Unexpected output in commcallx: [mpiexec@mejis] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP Unexpected output in commcallx: APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11) Program commcallx exited without No Errors Unexpected output in wincallx: Unexpected window Unexpected output in wincallx: [mpiexec@mejis] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP Unexpected output in wincallx: APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11) Program wincallx exited without No Errors Unexpected output in badport: [mpiexec@mejis] APPLICATION TIMED OUT Unexpected output in badport: [proxy:0:0@mejis] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:868): assert (!closed) failed Unexpected output in badport: [proxy:0:0@mejis] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status Unexpected output in badport: [proxy:0:0@mejis] main (./pm/pmiserv/pmip.c:208): demux engine error waiting for event Unexpected output in badport: [mpiexec@mejis] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting Unexpected output in badport: [mpiexec@mejis] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): launcher returned error waiting for completion Unexpected output in badport: [mpiexec@mejis] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for completion Unexpected output in badport: [mpiexec@mejis] main (./ui/mpich/mpiexec.c:404): process manager error waiting for completion Program badport exited without No Errors 1.4_rc2: Unexpected output in scancel: In direct memory block for handle type REQUEST, 3 handles are still allocated Unexpected output in scancel: In direct memory block for handle type REQUEST, 4 handles are still allocated Unexpected output in scancel: In direct memory block for handle type COMM, 2 handles are still allocated Unexpected output in scancel: [0] 32 at [0x0000000000c7b398], mpid_vc.c[81] Unexpected output in scancel: [0] 40 at [0x0000000000c7b2c8], mpid_vc.c[81] Unexpected output in scancel: [0] 8 at [0x0000000000c7b0b8], local_proc.c[91] Unexpected output in scancel: [0] 8 at [0x0000000000c7b008], local_proc.c[90] Unexpected output in scancel: [0] 40 at [0x0000000000c7ae88], mpid_vc.c[81] Unexpected output in pscancel: In direct memory block for handle type REQUEST, 3 handles are still allocated Unexpected output in pscancel: In direct memory block for handle type REQUEST, 4 handles are still allocated Unexpected output in pscancel: In direct memory block for handle type COMM, 2 handles are still allocated Unexpected output in pscancel: [0] 32 at [0x0000000001443398], mpid_vc.c[81] Unexpected output in pscancel: [0] 40 at [0x00000000014432c8], mpid_vc.c[81] Unexpected output in pscancel: [0] 8 at [0x00000000014430b8], local_proc.c[91] Unexpected output in pscancel: [0] 8 at [0x0000000001443008], local_proc.c[90] Unexpected output in pscancel: [0] 40 at [0x0000000001442e88], mpid_vc.c[81] Unexpected output in cancelrecv: In direct memory block for handle type REQUEST, 1 handles are still allocated
Created attachment 269751 [details] 1.3.2_p1 test failures
Created attachment 269753 [details] 1.4_rc2 test failures
Several of the failures in 1.3.2p1 are fixed in 1.4rc2, so I'm only going to address the one failure type you are seeing in 1.4rc2: the errors you are seeing in scancel and pscancel are resource leaks that show up in debug builds. These are known issues for a few of the tests (scancel, pscancel, cancelrecv and badport). You can disable them if you like, but these errors are mostly harmless.
+*mpich2-1.4_rc2 (04 May 2011) + + 04 May 2011; Justin Bronder <jsbronder@gentoo.org> +mpich2-1.4_rc2.ebuild: + Version bump (#362655). Use system hwloc. Use hydra instead of mpd for pm + (#145367). Disable more tests as recommended by upstream.
For the debug build, I suggest configuring with --enable-g=dbg. Just using --enable-g is equivalent to --enable-g=all which adds several internal checks that are mostly meant for developers. Btw, where can I view the latest version of the ebuild? The above attachment for 1.3.2p1 does not include the patches to disable tests.
(In reply to comment #10) > For the debug build, I suggest configuring with --enable-g=dbg. Just using > --enable-g is equivalent to --enable-g=all which adds several internal checks > that are mostly meant for developers. I'll push that change shortly. > > Btw, where can I view the latest version of the ebuild? The above attachment > for 1.3.2p1 does not include the patches to disable tests. http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/sys-cluster/mpich2/ I didn't push the 1.3.2_p1 bump, but reviewing the release notes now that may have been a mistake as I might be misunderstanding the relationship between 1.3.x and 1.4.x.
If you use --enable-g=dbg, you don't have to disable the scancel, pscancel and cancelrecv tests. Also, in the Fortran configure options, --enable-f77 is not required (it's on by default). Similarly, --enable-fc is not required; only disabling is required for g77. I think the ebuild uploaded by Dries already had this. Finally, the "if use mpi-threads" part is redundant. Only the else part is required. I think Dries' ebuild also covered this.
Also, I'm still seeing some of the I/O tests fail. Could this be due to the emerge features? (i.e. sandbox/userpriv/...)? Which features are you using for the test?
Created attachment 272113 [details, diff] Cleanup patch for unnecessary options in the ebuild
Please consider the attached patch to the ebuild.
Seems I can reproduce the async I/O test problem outside of emerge, so it must be my system (kernel/glib) or an mpich bug. In any case, not related to the ebuild. In case other people encounter the same problem: the mpich ticket is https://trac.mcs.anl.gov/projects/mpich2/ticket/1483
(In reply to comment #14) > Created attachment 272113 [details, diff] > Cleanup patch for unnecessary options in the ebuild I've re-enabled the tests as suggested by the patch. As far as the fix for #293665 goes, it is still very necessary for Gentoo packaging. The majority of the configure options I've neglected to change mainly because of a fan of the verbosity they provide. Especially as the defaults are not immediately obvious from 'configure --help'. (In reply to comment #13) > Also, I'm still seeing some of the I/O tests fail. Could this be due to the > emerge features? (i.e. sandbox/userpriv/...)? Which features are you using for > the test? Everything passes for me unless I enable romio, then I still get the timeout problems with the I/O tests. I think the failing tests should probably be a different bug though. FEATURES="assume-digests binpkg-logs ccache distlocks fixlafiles fixpackages news parallel-fetch protect-owned sandbox sfperms sign splitdebug strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"
I agree with leaving the failing I/O tests for another bug. It's actually very interesting, since until now, I was assuming these I/O tests were only failing on my system. Other people seem unable to reproduce these timeouts. Just to be sure: Could you confirm you're seeing async and some of the ixxxx tests hanging? (test/mpi/io directory) If so, the thing you and I have in common is gentoo and probably a very recent glibc (2.13-r2 for my system) and kernel (2.6.38.2).
(In reply to comment #18) > I agree with leaving the failing I/O tests for another bug. > > It's actually very interesting, since until now, I was assuming these I/O tests > were only failing on my system. Other people seem unable to reproduce these > timeouts. > > Just to be sure: Could you confirm you're seeing async and some of the ixxxx > tests hanging? (test/mpi/io directory) > > If so, the thing you and I have in common is gentoo and probably a very recent > glibc (2.13-r2 for my system) and kernel (2.6.38.2). The very same tests yes, but only with romio support. glibc-2.11.3 and 2.6.38-gentoo-1. I'll move my replies to the upstream bug.
(In reply to comment #17) > As far as the fix for > #293665 goes, it is still very necessary for Gentoo packaging. What is this used for? The conf files that the script updates are currently unused. So maybe you meant to replace these variables in mpicc, etc.?