Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 362655 - New ebuild for sys-cluster/mpich2 1.3.2p1 and 1.4 rc 2
Summary: New ebuild for sys-cluster/mpich2 1.3.2p1 and 1.4 rc 2
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: New packages (show other bugs)
Hardware: All Linux
: Normal enhancement (vote)
Assignee: Justin Bronder (RETIRED)
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-04-08 21:40 UTC by Dries Kimpe
Modified: 2011-05-10 06:21 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
mpich2-1.3.2_p1.ebuild (mpich2-1.3.2_p1.ebuild,3.29 KB, text/plain)
2011-04-08 21:43 UTC, Dries Kimpe
Details
1.3.2_p1 test failures (mpich2-1.3.2_p1.tests,7.72 KB, text/plain)
2011-04-13 03:29 UTC, Justin Bronder (RETIRED)
Details
1.4_rc2 test failures (mpich2-1.4_rc2.tests,1.52 KB, text/plain)
2011-04-13 03:30 UTC, Justin Bronder (RETIRED)
Details
Cleanup patch for unnecessary options in the ebuild (mpich2-1.4rc2-ebuild.patch,2.52 KB, patch)
2011-05-04 18:23 UTC, Pavan Balaji
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dries Kimpe 2011-04-08 21:40:53 UTC
Updated ebuild for mpich2 (version 1.3.2p1 and 1.4 rc2).

Based on the ebuild currently in portage, with the following modifications:
- removed support for MPD. MPD is deprecated. This also solves the conflict 
  report in bug 342851.
- Removed patches for issues that have since been fixed in mpich2.
- Removed python logic (was only required for the MPD launcher)
- Removed redundant configure options
- Fixed issue in install(): parallel make install not supported


Reproducible: Always
Comment 1 Dries Kimpe 2011-04-08 21:43:08 UTC
Created attachment 269065 [details]
mpich2-1.3.2_p1.ebuild

Ebuild for 1.3.2_p1 (also works for 1.4_rc2, except maybe for ARCH=~)
Comment 2 Justin Bronder (RETIRED) gentoo-dev 2011-04-11 18:22:25 UTC
(In reply to comment #0)
> Updated ebuild for mpich2 (version 1.3.2p1 and 1.4 rc2).

Awesome, but what's been holding me up from pushing an updated ebuild is that a bunch of the tests fail.  Have you made any headway there or is it only failing for me?
Comment 3 Dries Kimpe 2011-04-12 01:06:34 UTC
No, some tests are failing but I just checked with the mpich people and some tests are supposed to fail:

- I/O tests should pass on a local filesystem.
  -> Here the issue might be that the tests run in the directory from which make test is called. There might be a sandbox/privilege issue?
- large_message: is expected to fail on 'normal' systems

- expected to fail for now: non_zero_root, bcastlength.

I'll see if there is a way to skip the tests that might fail (due to system requirements or expected to fail).
Comment 4 Pavan Balaji 2011-04-12 01:17:12 UTC
A little bit more description on the tests:

large_message: this is only expected to run on systems with a large amount of memory. If you have less than 8GB of memory, disable this test.

bcastlength: this failure is not really an error in MPICH2. If the user writes a wrong program, MPICH2 tries to detect it, though it is not required to. This failure just means that MPICH2 is not able to detect this error in the user program. It can be disabled for now; I will file a ticket to re-enable it once we improve our internal checking in MPICH2.

non-zero-length: this is a performance test. this only makes sense when the system is not shared with any other compute-intensive programs. If something else is running on the system, it will likely show up as a false-positive. If this is the case, it might be better for you to disable all performance tests.

I/O tests: Are you running the tests on a local file system and not NFS? If yes, then the failures are not expected. Can you confirm?
Comment 5 Justin Bronder (RETIRED) gentoo-dev 2011-04-13 03:29:07 UTC
After excluding large_message, bcastlength and non_zero_root
the following tests are failing:

Comments indicate that gtranksperf may spawn too many
processes (I'm using a Q9550), but I'm not seeing any notes
for the others.

1.3.2_p1:

Unexpected output in gtranksperf: too much difference in MPI_Group_translate_ranks performance:
Unexpected output in gtranksperf: time1=0.734379 time2=0.138209
Unexpected output in gtranksperf: (fabs(time1-time2)/time2)=4.313543
Unexpected output in gtranksperf:  Found 1 errors
Program gtranksperf exited without No Errors
Unexpected output in scancel: In direct memory block for handle type REQUEST, 4 handles are still allocated
Unexpected output in scancel: In direct memory block for handle type COMM, 2 handles are still allocated
Unexpected output in scancel: [0] 24 at [0x00000000011b7398], mpid_vc.c[79]
Unexpected output in scancel: [0] 32 at [0x00000000011b72c8], mpid_vc.c[79]
Unexpected output in scancel: [0] 8 at [0x00000000011b70b8], local_proc.c[91]
Unexpected output in scancel: [0] 8 at [0x00000000011b7008], local_proc.c[90]
Unexpected output in scancel: [0] 32 at [0x00000000011b6e88], mpid_vc.c[79]
Unexpected output in scancel: In direct memory block for handle type REQUEST, 3 handles are still allocated
Unexpected output in pscancel: In direct memory block for handle type REQUEST, 4 handles are still allocated
Unexpected output in pscancel: In direct memory block for handle type COMM, 2 handles are still allocated
Unexpected output in pscancel: [0] 24 at [0x0000000000df4398], mpid_vc.c[79]
Unexpected output in pscancel: [0] 32 at [0x0000000000df42c8], mpid_vc.c[79]
Unexpected output in pscancel: [0] 8 at [0x0000000000df40b8], local_proc.c[91]
Unexpected output in pscancel: [0] 8 at [0x0000000000df4008], local_proc.c[90]
Unexpected output in pscancel: [0] 32 at [0x0000000000df3e88], mpid_vc.c[79]
Unexpected output in pscancel: In direct memory block for handle type REQUEST, 3 handles are still allocated
Unexpected output in cancelrecv: In direct memory block for handle type REQUEST, 1 handles are still allocated
Unexpected output in distgraph1: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack:
Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(185):  MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=4, sources=0x15a2040, sourceweights=(nil), outdegree=4, destinations=0x15a2200, destweights=(nil), MPI_INFO_NULL, 2, comm_dist_graph=0x7fff180d276c) failed
Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights
Unexpected output in distgraph1: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack:
Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(185):  MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=4, sources=0x1bdb020, sourceweights=(nil), outdegree=4, destinations=0x1bdad30, destweights=(nil), MPI_INFO_NULL, 2, comm_dist_graph=0x7fffb91999dc) failed
Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights
Unexpected output in distgraph1: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack:
Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(185):  MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=4, sources=0x1d42020, sourceweights=(nil), outdegree=4, destinations=0x1d41d30, destweights=(nil), MPI_INFO_NULL, 2, comm_dist_graph=0x7fff5cd3769c) failed
Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights
Unexpected output in distgraph1: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack:
Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(185):  MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=4, sources=0x1157020, sourceweights=(nil), outdegree=4, destinations=0x1156d30, destweights=(nil), MPI_INFO_NULL, 2, comm_dist_graph=0x7ffffb20822c) failed
Unexpected output in distgraph1: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights
Unexpected output in distgraph1: [mpiexec@mejis] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
Unexpected output in distgraph1: APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
Program distgraph1 exited without No Errors
Unexpected output in dgraph_unwgt: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack:
Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(185):  MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=2, sources=0x7fff9354a3b0, sourceweights=(nil), outdegree=2, destinations=0x7fff9354a3a0, destweights=(nil), MPI_INFO_NULL, 1, comm_dist_graph=0x7fff9354a3cc) failed
Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights
Unexpected output in dgraph_unwgt: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack:
Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(185):  MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=2, sources=0x7fff6f5868e0, sourceweights=(nil), outdegree=2, destinations=0x7fff6f5868d0, destweights=(nil), MPI_INFO_NULL, 1, comm_dist_graph=0x7fff6f5868fc) failed
Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights
Unexpected output in dgraph_unwgt: Fatal error in PMPI_Dist_graph_create_adjacent: Invalid argument, error stack:
Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(185):  MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=2, sources=0x7fff26aa3940, sourceweights=(nil), outdegree=2, destinations=0x7fff26aa3930, destweights=(nil), MPI_INFO_NULL, 1, comm_dist_graph=0x7fff26aa395c) failed
Unexpected output in dgraph_unwgt: PMPI_Dist_graph_create_adjacent(120): Null pointer in parameter sourceweights
Unexpected output in dgraph_unwgt: [mpiexec@mejis] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
Unexpected output in dgraph_unwgt: APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
Program dgraph_unwgt exited without No Errors
Unexpected output in commcallx: Unexpected communicator
Unexpected output in commcallx: Unexpected communicator
Unexpected output in commcallx: [mpiexec@mejis] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
Unexpected output in commcallx: APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
Program commcallx exited without No Errors
Unexpected output in wincallx: Unexpected window
Unexpected output in wincallx: [mpiexec@mejis] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
Unexpected output in wincallx: APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
Program wincallx exited without No Errors
Unexpected output in badport: [mpiexec@mejis] APPLICATION TIMED OUT
Unexpected output in badport: [proxy:0:0@mejis] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:868): assert (!closed) failed
Unexpected output in badport: [proxy:0:0@mejis] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
Unexpected output in badport: [proxy:0:0@mejis] main (./pm/pmiserv/pmip.c:208): demux engine error waiting for event
Unexpected output in badport: [mpiexec@mejis] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
Unexpected output in badport: [mpiexec@mejis] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:18): launcher returned error waiting for completion
Unexpected output in badport: [mpiexec@mejis] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for completion
Unexpected output in badport: [mpiexec@mejis] main (./ui/mpich/mpiexec.c:404): process manager error waiting for completion
Program badport exited without No Errors

1.4_rc2:
Unexpected output in scancel: In direct memory block for handle type REQUEST, 3 handles are still allocated
Unexpected output in scancel: In direct memory block for handle type REQUEST, 4 handles are still allocated
Unexpected output in scancel: In direct memory block for handle type COMM, 2 handles are still allocated
Unexpected output in scancel: [0] 32 at [0x0000000000c7b398], mpid_vc.c[81]
Unexpected output in scancel: [0] 40 at [0x0000000000c7b2c8], mpid_vc.c[81]
Unexpected output in scancel: [0] 8 at [0x0000000000c7b0b8], local_proc.c[91]
Unexpected output in scancel: [0] 8 at [0x0000000000c7b008], local_proc.c[90]
Unexpected output in scancel: [0] 40 at [0x0000000000c7ae88], mpid_vc.c[81]
Unexpected output in pscancel: In direct memory block for handle type REQUEST, 3 handles are still allocated
Unexpected output in pscancel: In direct memory block for handle type REQUEST, 4 handles are still allocated
Unexpected output in pscancel: In direct memory block for handle type COMM, 2 handles are still allocated
Unexpected output in pscancel: [0] 32 at [0x0000000001443398], mpid_vc.c[81]
Unexpected output in pscancel: [0] 40 at [0x00000000014432c8], mpid_vc.c[81]
Unexpected output in pscancel: [0] 8 at [0x00000000014430b8], local_proc.c[91]
Unexpected output in pscancel: [0] 8 at [0x0000000001443008], local_proc.c[90]
Unexpected output in pscancel: [0] 40 at [0x0000000001442e88], mpid_vc.c[81]
Unexpected output in cancelrecv: In direct memory block for handle type REQUEST, 1 handles are still allocated
Comment 6 Justin Bronder (RETIRED) gentoo-dev 2011-04-13 03:29:36 UTC
Created attachment 269751 [details]
1.3.2_p1 test failures
Comment 7 Justin Bronder (RETIRED) gentoo-dev 2011-04-13 03:30:00 UTC
Created attachment 269753 [details]
1.4_rc2 test failures
Comment 8 Pavan Balaji 2011-04-13 04:25:09 UTC
Several of the failures in 1.3.2p1 are fixed in 1.4rc2, so I'm only going to address the one failure type you are seeing in 1.4rc2: the errors you are seeing in scancel and pscancel are resource leaks that show up in debug builds. These are known issues for a few of the tests (scancel, pscancel, cancelrecv and badport). You can disable them if you like, but these errors are mostly harmless.
Comment 9 Justin Bronder (RETIRED) gentoo-dev 2011-05-04 03:31:33 UTC
+*mpich2-1.4_rc2 (04 May 2011)
+
+  04 May 2011; Justin Bronder <jsbronder@gentoo.org> +mpich2-1.4_rc2.ebuild:
+  Version bump (#362655). Use system hwloc. Use hydra instead of mpd for pm
+  (#145367). Disable more tests as recommended by upstream.
Comment 10 Pavan Balaji 2011-05-04 03:41:44 UTC
For the debug build, I suggest configuring with --enable-g=dbg. Just using --enable-g is equivalent to --enable-g=all which adds several internal checks that are mostly meant for developers.

Btw, where can I view the latest version of the ebuild? The above attachment for 1.3.2p1 does not include the patches to disable tests.
Comment 11 Justin Bronder (RETIRED) gentoo-dev 2011-05-04 03:57:04 UTC
(In reply to comment #10)
> For the debug build, I suggest configuring with --enable-g=dbg. Just using
> --enable-g is equivalent to --enable-g=all which adds several internal checks
> that are mostly meant for developers.

I'll push that change shortly.

> 
> Btw, where can I view the latest version of the ebuild? The above attachment
> for 1.3.2p1 does not include the patches to disable tests.

http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/sys-cluster/mpich2/

I didn't push the 1.3.2_p1 bump, but reviewing the release notes now that may have been a mistake as I might be misunderstanding the relationship between 1.3.x and 1.4.x.
Comment 12 Pavan Balaji 2011-05-04 04:03:43 UTC
If you use --enable-g=dbg, you don't have to disable the scancel, pscancel and cancelrecv tests.

Also, in the Fortran configure options, --enable-f77 is not required (it's on by default). Similarly, --enable-fc is not required; only disabling is required for g77. I think the ebuild uploaded by Dries already had this.

Finally, the "if use mpi-threads" part is redundant. Only the else part is required. I think Dries' ebuild also covered this.
Comment 13 Dries Kimpe 2011-05-04 17:29:50 UTC
Also, I'm still seeing some of the I/O tests fail. Could this be due to the emerge features? (i.e. sandbox/userpriv/...)? Which features are you using for the test?
Comment 14 Pavan Balaji 2011-05-04 18:23:29 UTC
Created attachment 272113 [details, diff]
Cleanup patch for unnecessary options in the ebuild
Comment 15 Pavan Balaji 2011-05-04 18:24:21 UTC
Please consider the attached patch to the ebuild.
Comment 16 Dries Kimpe 2011-05-04 19:56:15 UTC
Seems I can reproduce the async I/O test problem outside of emerge, so it must be my system (kernel/glib) or an mpich bug. In any case, not related to the ebuild.

In case other people encounter the same problem:
the mpich ticket is https://trac.mcs.anl.gov/projects/mpich2/ticket/1483
Comment 17 Justin Bronder (RETIRED) gentoo-dev 2011-05-05 19:57:31 UTC
(In reply to comment #14)
> Created attachment 272113 [details, diff]
> Cleanup patch for unnecessary options in the ebuild

I've re-enabled the tests as suggested by the patch.  As far as the fix for #293665 goes, it is still very necessary for Gentoo packaging.  The majority of the configure options I've neglected to change mainly because of a fan of the verbosity they provide.  Especially as the defaults are not immediately obvious from 'configure --help'.

(In reply to comment #13)
> Also, I'm still seeing some of the I/O tests fail. Could this be due to the
> emerge features? (i.e. sandbox/userpriv/...)? Which features are you using for
> the test?

Everything passes for me unless I enable romio, then I still get the timeout problems with the I/O tests.  I think the failing tests should probably be a different bug though.

FEATURES="assume-digests binpkg-logs ccache distlocks fixlafiles fixpackages news parallel-fetch protect-owned sandbox sfperms sign splitdebug strict unknown-features-warn unmerge-logs unmerge-orphans userfetch"
Comment 18 Dries Kimpe 2011-05-05 20:08:53 UTC
I agree with leaving the failing I/O tests for another bug.

It's actually very interesting, since until now, I was assuming these I/O tests  were only failing on my system. Other people seem unable to reproduce these timeouts.

Just to be sure: Could you confirm you're seeing async and some of the ixxxx tests hanging? (test/mpi/io directory)

If so, the thing you and I have in common is gentoo and probably a very recent glibc (2.13-r2 for my system) and kernel (2.6.38.2).
Comment 19 Justin Bronder (RETIRED) gentoo-dev 2011-05-05 20:49:05 UTC
(In reply to comment #18)
> I agree with leaving the failing I/O tests for another bug.
> 
> It's actually very interesting, since until now, I was assuming these I/O tests
>  were only failing on my system. Other people seem unable to reproduce these
> timeouts.
> 
> Just to be sure: Could you confirm you're seeing async and some of the ixxxx
> tests hanging? (test/mpi/io directory)
> 
> If so, the thing you and I have in common is gentoo and probably a very recent
> glibc (2.13-r2 for my system) and kernel (2.6.38.2).

The very same tests yes, but only with romio support.  glibc-2.11.3 and 2.6.38-gentoo-1.  I'll move my replies to the upstream bug.
Comment 20 Pavan Balaji 2011-05-10 06:21:12 UTC
(In reply to comment #17)
> As far as the fix for
> #293665 goes, it is still very necessary for Gentoo packaging.

What is this used for? The conf files that the script updates are currently unused. So maybe you meant to replace these variables in mpicc, etc.?