Bug 757804

Summary:	virtual/mpi: status of the MPI overlay and the GSOC 2017 project final output - fixing up the MPI subsystem in gentoo
Product:	Gentoo Linux	Reporter:	Aisha Tammy <gentoo>
Component:	Current packages	Assignee:	Gentoo Cluster Team <cluster>
Status:	CONFIRMED ---
Severity:	normal	CC:	lssndrbarbieri
Priority:	Normal
Version:	unspecified
Hardware:	All
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---

Description Aisha Tammy 2020-11-30 19:16:14 UTC

The GSOC 2017 project was trying to get multiple MPI packages to work together in a coherent manner.
https://wiki.gentoo.org/wiki/Google_Summer_of_Code/2017/Projects

Was there any usable output from that endeavor?

I see the MPI Overlay wiki page - https://wiki.gentoo.org/wiki/Google_Summer_of_Code/2017/Projects
and the github repository - https://github.com/gilroy/gentoo-mpi

They have both not been updated for a long time.

Currently the cluster team seems a bit dead and the MPI subsystem in gentoo seems bad, to say the least.

zlogene has kept it afloat (kudos!), maybe we can try to get it semi-sane again.

The first step would be to identify a starting point.

So the questions we start with: Was the GSOC project method a right idea? Do we want to try to extend that or learn from the experience and brain storm a different technique?

Comment 1 Aisha Tammy 2020-11-30 19:17:33 UTC

Assigning bircoph as they were a mentor for the GSOC project (the other mentor jsbronder has retired)

Also adding cluster team.

Comment 2 Andrew Savchenko gentoo-dev

2020-12-01 00:38:04 UTC

Hi Aisha!

Unfortunately due to various reasons student was not able to complete the GSoC 2017 Gentoo MPI project. So we have little usable output from it.

The idea itself is usable, but it is not the only possible way. Another idea is to use run-time switch like was implemented for BLAS/LAPACK in another GSoC project. But the latter idea is usable only if we can identify common subset between various MPI APIs and I'm afraid that's unlikely: they are way too different.

There was also an idea discussed during FOSDEM dinners (iirc with Haubi) that different prefixes may be used for different MPI implementations. But I never pondered on that idea in more details.

The problem with the original idea is that packages using MPI will have to do many modifications in order to accommodate changes. But I see no way to implement it transparently.

Comment 3 Aisha Tammy 2020-12-01 02:28:50 UTC

Thanks a lot for responding so quickly!!

> Another idea is to use run-time switch like was implemented for BLAS/LAPACK
> in another GSoC project. But the latter idea is usable only if we can 
> identify common subset between various MPI APIs and I'm afraid that's
> unlikely: they are way too different.

I agree, this is not going to be possible. I sent out an email to OpenMPI yesterday and they said that even just within OpenMPI implementations, packages would need to be rebuilt if changing major versions of OpenMPI. 
Across different implementations is just impossible.


> There was also an idea discussed during FOSDEM dinners (iirc with Haubi) that 
> different prefixes may be used for different MPI implementations. But I never 
> pondered on that idea in more details.

This would (possibly) allow us to have multiple MPI implementations, but this doesn't solve the problem of having a package be built with multiple MPI implementations. That would need us to have *that* consumer package also be installed in a prefix and then doing runtime changing of that.

> The problem with the original idea is that packages using MPI will have to
> do many modifications in order to accommodate changes. But I see no way 
> to implement it transparently.

I agree, this sounds like some weird combination of environment modules (like Lmod or tcl modules), where we need slotting for each MPI consumer package, and if we eselect switch OpenMPI to MPICH then all packages should also be shifted. Basically, create a mini gentoo-prefix for each MPI implementation and then activate on demand. 

That sounds crazy complex and very prone to breaking...

I am very unfamiliar with what the original idea was trying to do to address this issue. I could not find a lot of documentation apart from the wiki page and the github code. Given that the code is really large, getting a readable and understandable outline here would be helpful. :D

Comment 4 Andrew Savchenko gentoo-dev

2020-12-05 13:48:09 UTC

(In reply to Aisha Tammy from comment #3)
> > There was also an idea discussed during FOSDEM dinners (iirc with Haubi) that 
> > different prefixes may be used for different MPI implementations. But I never 
> > pondered on that idea in more details.
> 
> This would (possibly) allow us to have multiple MPI implementations, but
> this doesn't solve the problem of having a package be built with multiple
> MPI implementations. That would need us to have *that* consumer package also
> be installed in a prefix and then doing runtime changing of that.

The only way to handle multiple MPI implementations support is to build a package (or an MPI-dependent part of it) multiple times with each desired MPI. The basic idea is no different from how Gentoo handles multiple python or ruby versions.
 
> > The problem with the original idea is that packages using MPI will have to
> > do many modifications in order to accommodate changes. But I see no way 
> > to implement it transparently.
> 
> I agree, this sounds like some weird combination of environment modules
> (like Lmod or tcl modules), where we need slotting for each MPI consumer
> package, and if we eselect switch OpenMPI to MPICH then all packages should
> also be shifted. Basically, create a mini gentoo-prefix for each MPI
> implementation and then activate on demand. 
> 
> That sounds crazy complex and very prone to breaking...
> 
> I am very unfamiliar with what the original idea was trying to do to address
> this issue. I could not find a lot of documentation apart from the wiki page
> and the github code. Given that the code is really large, getting a readable
> and understandable outline here would be helpful. :D

The original idea was layered (each next layer is more complex), so let's start from the core functionality:

1. Support of different MPI implementations (e.g. openmpi and mpich).

The package code will be rebuild for each MPI implementation and helper eclasses will provide a convenient way to do this. Basically idea is the same as for python: if one needs both python3.7 and python3.8 support, code will be rebuild each time for different version and results will be placed in different paths.

1.1. mpi-providers.eclass provides functions to build and install various MPI implementations in a way we want to: so that their paths for both binaries and headers, docs and all other files can coexist and follow common pattern to be switched at run-time.

1.2. mpi-select.eclass provides a way to select MPI implementations for packages *using* MPI directly on indirectly.

Each MPI implementation supported is selectible via MPI_COMPAT.
For each selected implementation we:
1.2.1. Switch environment to that implementation; this means that application can find desired MPI implementation and all software linked with it via available environment variables, so no extra configuration for the package will be required.
1.2.2. Build package for selected implementation
1.2.3. Install its files in MPI-dependent dirs, e.g. /usr/lib/mpi/$implementation/{bin,lib,etc}.

1.3. The idea to use modules come from the wish to provide unprivileged users ability to easily switch their custom environment (and they may have many!) for any required MPI implementation. This solution was targeted on HPC clusters after all, so users will have a lot of custom MPI software with different and conflicting requirements.

This is not the only way to do this, eselect can be used as well. But eselect modules are hard to implement in a way they will work for unprivileged users and event harder in a way that will allow users to have multiple different environments active at the same time (e.g. in different shells for different tasks).

So modules was selected for this job. As a bonus we have a tool that many HPC users already know.

2. After implementing the above paragraph 1 we can get stuff more complicated in several directions:
2.1. Support more MPI implementations.
2.2. Support more granularity in MPI switching.
2.3. Support more granularity in package building.

2.1. Support more MPI implementations.

Add not just mpich and openmpi, but also mpich2, mvapich2, intel mpi and others if necessary.

2.2. Support more granularity in MPI switching.

2.2.1. Support multiple versions of the same implementation.

In par.1 we just supported different MPI implementations. So let's now support different versions of the same implementation.

This is much harder task actually when it comes to details. We can't just enlist all versions as is done for python, because there are too many versions and users may require different very specific minor versions, e.g. 4.0.2 and 4.0.3. I had such cases on real Gentoo HPC cluster when I was working in the HPC. Versions were different, but the problem the same.

In order to implement this much more complicated MPI_COMPAT handling is required, probably with some range specificators like (=openmpi-4.0.2, >= openmpi-3.1.0 <= openmpi-3.1.2).

2.2.2. SUpport multiple instances of the very same version of the same implementation.

MPI can be configured with hundreds of flags. Some configurations are not compatible or have large performance impact on some tasks. I have a real use case where two different users requested the very same version configured differently and in non-compatible way (so it is not possible to build openmpi in a way that will satisfy both at the same time).

This is the hardest job of all. Basically we need to be able to handle multiple versions of the same PVR build with different USE flags. Portage is just not designed for this. I see no good solution here. Possible ideas here:
- create fake mpi ${PN} for different set of USE flags;
- create fake slots;
- use different prefixes.

2.3. Support more granularity in package building.

Often only a small portion of a large package uses MPI: e.g. some library of binary. If we are talking about something huge like ROOT, it would be really cool if we can rebuild only MPI-dependant parts for each MPI implementation instead of rebuilding whole package N times.

Such job can be facilitated by mpi-select.eclass helper functions. Probably some automation may help to lower job for maintainers of MPI packages.

3. Port MPI packages.

After all this or at least par.1 we need to port actual MPI software to use this new approach. It will take much time as well.

*****************************************

So, that was my whole idea. As you see it is very complicated and goes far beyond 3 months of GSoC. It was my hope that during GSoC 2017 part 1 described above will be implemented at least for openmpi and mpich and at least HPL (as a perfect example and testing tool) will be adapted for a new infrastructure. But even this was not possible to implement within 3 months with capable student, but still beyond professional level in this field. (GSoC students are not required to be professionals, the program is here to help them learn to become one.)

Comment 5 Aisha Tammy 2020-12-05 17:37:07 UTC

(Will give a larger response for the whole comment in a while)

> So modules was selected for this job. As a bonus we have a tool that many HPC users already know.

AH HA!!!
Nice, this was the key part I was failing to see, about how you were managing environment!!

As an awesome coincidence, we have modules and Lmod both currently available (up to date and all tests passing) in the ::science overlay!

So there is hope yet :)

Maybe as a starting point we can move modules and Lmod to ::gentoo first.
Having environment management available in tree sounds like an awesome tool to have.

This tiny first step sounds good?

Comment 6 Andrew Savchenko gentoo-dev

2020-12-05 21:55:13 UTC

(In reply to Aisha Tammy from comment #5)
> As an awesome coincidence, we have modules and Lmod both currently available
> (up to date and all tests passing) in the ::science overlay!

Well, this is not entirely a coincidence: if you will look at the git log, I supported modules back in 2017 during GSoC and improved the package with results based on some testing it received during GSoC.
 
> So there is hope yet :)
> 
> Maybe as a starting point we can move modules and Lmod to ::gentoo first.
> Having environment management available in tree sounds like an awesome tool
> to have.
> 
> This tiny first step sounds good?

Yes, sounds nice.

Comment 7 Aisha Tammy 2020-12-05 22:24:40 UTC

> The only way to handle multiple MPI implementations support is to build a
> package (or an MPI-dependent part of it) multiple times with each desired
> MPI. The basic idea is no different from how Gentoo handles multiple python
> or ruby versions.

You want to create a wrapper like https://github.com/mgorny/python-exec ?

I am not sure if I am too keen on the idea just yet.

> The original idea was layered (each next layer is more complex), so let's
> start from the core functionality:
> 
> 1. Support of different MPI implementations (e.g. openmpi and mpich).
> 
> The package code will be rebuild for each MPI implementation and helper
> eclasses will provide a convenient way to do this. Basically idea is the
> same as for python: if one needs both python3.7 and python3.8 support, code
> will be rebuild each time for different version and results will be placed
> in different paths.
> 
> 1.1. mpi-providers.eclass provides functions to build and install various
> MPI implementations in a way we want to: so that their paths for both
> binaries and headers, docs and all other files can coexist and follow common
> pattern to be switched at run-time.
> 
> 1.2. mpi-select.eclass provides a way to select MPI implementations for
> packages *using* MPI directly on indirectly.
> 
> Each MPI implementation supported is selectible via MPI_COMPAT.
> For each selected implementation we:
> 1.2.1. Switch environment to that implementation; this means that
> application can find desired MPI implementation and all software linked with
> it via available environment variables, so no extra configuration for the
> package will be required.
> 1.2.2. Build package for selected implementation
> 1.2.3. Install its files in MPI-dependent dirs, e.g.
> /usr/lib/mpi/$implementation/{bin,lib,etc}.
> 
> 1.3. The idea to use modules come from the wish to provide unprivileged
> users ability to easily switch their custom environment (and they may have
> many!) for any required MPI implementation. This solution was targeted on
> HPC clusters after all, so users will have a lot of custom MPI software with
> different and conflicting requirements.
> 
> This is not the only way to do this, eselect can be used as well. But
> eselect modules are hard to implement in a way they will work for
> unprivileged users and event harder in a way that will allow users to have
> multiple different environments active at the same time (e.g. in different
> shells for different tasks).
> 
> So modules was selected for this job. As a bonus we have a tool that many
> HPC users already know.

I do prefer the modules implementation. But how does this provide a default implementation? Won't users need to activate at least one of them to be used?

If we use modules, does the wrapper (like the python-exec) still need to be created? I think we only need one of them right?

> 2. After implementing the above paragraph 1 we can get stuff more
> complicated in several directions:
> 2.1. Support more MPI implementations.
> 2.2. Support more granularity in MPI switching.
> 2.3. Support more granularity in package building.
> 
> 2.1. Support more MPI implementations.
> 
> Add not just mpich and openmpi, but also mpich2, mvapich2, intel mpi and
> others if necessary.

Yes, this is a good idea. I have an intel MPI PR open to bring it into tree (although with only rudimentary functionality) -
https://github.com/gentoo/gentoo/pull/18452


> 
> 2.2. Support more granularity in MPI switching.
> 
> 2.2.1. Support multiple versions of the same implementation.
> 
> In par.1 we just supported different MPI implementations. So let's now
> support different versions of the same implementation.
> 
> This is much harder task actually when it comes to details. We can't just
> enlist all versions as is done for python, because there are too many
> versions and users may require different very specific minor versions, e.g.
> 4.0.2 and 4.0.3. I had such cases on real Gentoo HPC cluster when I was
> working in the HPC. Versions were different, but the problem the same.
> 
> In order to implement this much more complicated MPI_COMPAT handling is
> required, probably with some range specificators like (=openmpi-4.0.2, >=
> openmpi-3.1.0 <= openmpi-3.1.2).
> 
> 2.2.2. SUpport multiple instances of the very same version of the same
> implementation.
> 
> MPI can be configured with hundreds of flags. Some configurations are not
> compatible or have large performance impact on some tasks. I have a real use
> case where two different users requested the very same version configured
> differently and in non-compatible way (so it is not possible to build
> openmpi in a way that will satisfy both at the same time).
> 
> This is the hardest job of all. Basically we need to be able to handle
> multiple versions of the same PVR build with different USE flags. Portage is
> just not designed for this. I see no good solution here. Possible ideas here:
> - create fake mpi ${PN} for different set of USE flags;
> - create fake slots;
> - use different prefixes.
> 

I am quite against this, even though it is a very tempting thing to have.

We are quickly devolving to making this into a full fledged modules provisioning  for packages which portage is not designed for. 

Instead if we want to do this we can use a different environment modules provider - Lmod.
Lmod supports hierarchical modules - https://lmod.readthedocs.io/en/latest/080_hierarchy.html
Lmod is also backwards compatible with the TCL/TK modules - https://lmod.readthedocs.io/en/latest/045_transition.html

> - Lmod reads modulefiles written in TCL. There is typically no need to translate 
> modulefiles written in TCL into Lua. Lmod does this for you automatically.
> - Some users can run Lmod while others use the old environment module system.
> - However, no user can run both at the same time in the same shell.

Lmod also has builtin conversion for using modules files - https://lmod.readthedocs.io/en/latest/073_tmod_to_lmod.html

> 2.3. Support more granularity in package building.
> 
> Often only a small portion of a large package uses MPI: e.g. some library of
> binary. If we are talking about something huge like ROOT, it would be really
> cool if we can rebuild only MPI-dependant parts for each MPI implementation
> instead of rebuilding whole package N times.
> 
> Such job can be facilitated by mpi-select.eclass helper functions. Probably
> some automation may help to lower job for maintainers of MPI packages.
> 
> 3. Port MPI packages.
> 
> After all this or at least par.1 we need to port actual MPI software to use
> this new approach. It will take much time as well.
> 
> *****************************************
> 
> So, that was my whole idea. As you see it is very complicated and goes far
> beyond 3 months of GSoC. It was my hope that during GSoC 2017 part 1
> described above will be implemented at least for openmpi and mpich and at
> least HPL (as a perfect example and testing tool) will be adapted for a new
> infrastructure. But even this was not possible to implement within 3 months
> with capable student, but still beyond professional level in this field.
> (GSoC students are not required to be professionals, the program is here to
> help them learn to become one.)

This is a staggering amount of work!

There is no one who has even remotely managed to try this out and even professional clusters like redhat do not have even remotest support for this kinds of configs.
If we want to make it possible, we really need to make sure to do this very very carefully.
I suggest we start one thing at a time, doing all of the above together is neigh impossible.

> The only way to handle multiple MPI implementations support is to build a
> package (or an MPI-dependent part of it) multiple times with each desired
> MPI. The basic idea is no different from how Gentoo handles multiple python
> or ruby versions.

You want to create a wrapper like https://github.com/mgorny/python-exec ?

I am not sure if I am too keen on the idea just yet.

> The original idea was layered (each next layer is more complex), so let's
> start from the core functionality:
> 
> 1. Support of different MPI implementations (e.g. openmpi and mpich).
> 
> The package code will be rebuild for each MPI implementation and helper
> eclasses will provide a convenient way to do this. Basically idea is the
> same as for python: if one needs both python3.7 and python3.8 support, code
> will be rebuild each time for different version and results will be placed
> in different paths.
> 
> 1.1. mpi-providers.eclass provides functions to build and install various
> MPI implementations in a way we want to: so that their paths for both
> binaries and headers, docs and all other files can coexist and follow common
> pattern to be switched at run-time.
> 
> 1.2. mpi-select.eclass provides a way to select MPI implementations for
> packages *using* MPI directly on indirectly.
> 
> Each MPI implementation supported is selectible via MPI_COMPAT.
> For each selected implementation we:
> 1.2.1. Switch environment to that implementation; this means that
> application can find desired MPI implementation and all software linked with
> it via available environment variables, so no extra configuration for the
> package will be required.
> 1.2.2. Build package for selected implementation
> 1.2.3. Install its files in MPI-dependent dirs, e.g.
> /usr/lib/mpi/$implementation/{bin,lib,etc}.
> 
> 1.3. The idea to use modules come from the wish to provide unprivileged
> users ability to easily switch their custom environment (and they may have
> many!) for any required MPI implementation. This solution was targeted on
> HPC clusters after all, so users will have a lot of custom MPI software with
> different and conflicting requirements.
> 
> This is not the only way to do this, eselect can be used as well. But
> eselect modules are hard to implement in a way they will work for
> unprivileged users and event harder in a way that will allow users to have
> multiple different environments active at the same time (e.g. in different
> shells for different tasks).
> 
> So modules was selected for this job. As a bonus we have a tool that many
> HPC users already know.

I do prefer the modules implementation. But how does this provide a default implementation? Won't users need to activate at least one of them to be used?

If we use modules, does the wrapper (like the python-exec) still need to be created? I think we only need one of them right?

> 2. After implementing the above paragraph 1 we can get stuff more
> complicated in several directions:
> 2.1. Support more MPI implementations.
> 2.2. Support more granularity in MPI switching.
> 2.3. Support more granularity in package building.
> 
> 2.1. Support more MPI implementations.
> 
> Add not just mpich and openmpi, but also mpich2, mvapich2, intel mpi and
> others if necessary.

Yes, this is a good idea. I have an intel MPI PR open to bring it into tree (although with only rudimentary functionality) -
https://github.com/gentoo/gentoo/pull/18452


> 
> 2.2. Support more granularity in MPI switching.
> 
> 2.2.1. Support multiple versions of the same implementation.
> 
> In par.1 we just supported different MPI implementations. So let's now
> support different versions of the same implementation.
> 
> This is much harder task actually when it comes to details. We can't just
> enlist all versions as is done for python, because there are too many
> versions and users may require different very specific minor versions, e.g.
> 4.0.2 and 4.0.3. I had such cases on real Gentoo HPC cluster when I was
> working in the HPC. Versions were different, but the problem the same.
> 
> In order to implement this much more complicated MPI_COMPAT handling is
> required, probably with some range specificators like (=openmpi-4.0.2, >=
> openmpi-3.1.0 <= openmpi-3.1.2).
> 
> 2.2.2. SUpport multiple instances of the very same version of the same
> implementation.
> 
> MPI can be configured with hundreds of flags. Some configurations are not
> compatible or have large performance impact on some tasks. I have a real use
> case where two different users requested the very same version configured
> differently and in non-compatible way (so it is not possible to build
> openmpi in a way that will satisfy both at the same time).
> 
> This is the hardest job of all. Basically we need to be able to handle
> multiple versions of the same PVR build with different USE flags. Portage is
> just not designed for this. I see no good solution here. Possible ideas here:
> - create fake mpi ${PN} for different set of USE flags;
> - create fake slots;
> - use different prefixes.
> 

I am quite against this, at least the way it has been proposed to be solved.

We are quickly devolving to making this into a full fledged modules provisioning  for packages which portage is not designed for. 

Instead, if we want to do this we can use a different environment modules provider - Lmod.
Lmod supports hierarchical modules - https://lmod.readthedocs.io/en/latest/080_hierarchy.html
Lmod is also backwards compatible with the TCL/TK modules - https://lmod.readthedocs.io/en/latest/045_transition.html

> - Lmod reads modulefiles written in TCL. There is typically no need to translate 
> modulefiles written in TCL into Lua. Lmod does this for you automatically.
> - Some users can run Lmod while others use the old environment module system.
> - However, no user can run both at the same time in the same shell.

Lmod also has builtin conversion for using modules files - https://lmod.readthedocs.io/en/latest/073_tmod_to_lmod.html

> 2.3. Support more granularity in package building.
> 
> Often only a small portion of a large package uses MPI: e.g. some library of
> binary. If we are talking about something huge like ROOT, it would be really
> cool if we can rebuild only MPI-dependant parts for each MPI implementation
> instead of rebuilding whole package N times.
> 
> Such job can be facilitated by mpi-select.eclass helper functions. Probably
> some automation may help to lower job for maintainers of MPI packages.
> 
> 3. Port MPI packages.
> 
> After all this or at least par.1 we need to port actual MPI software to use
> this new approach. It will take much time as well.
> 
> *****************************************
> 
> So, that was my whole idea. As you see it is very complicated and goes far
> beyond 3 months of GSoC. It was my hope that during GSoC 2017 part 1
> described above will be implemented at least for openmpi and mpich and at
> least HPL (as a perfect example and testing tool) will be adapted for a new
> infrastructure. But even this was not possible to implement within 3 months
> with capable student, but still beyond professional level in this field.
> (GSoC students are not required to be professionals, the program is here to
> help them learn to become one.)

This is a staggering amount of work!

There is no one who has even managed to try to flesh this out and even professional clusters like rhel do not have even remotest support for these kinds of configs.
If we want to make it possible, we really need to make sure to do this very very carefully.
I suggest we start one thing at a time, doing all of the above together is neigh impossible.

Initial steps:

- import modules
- import Lmod (needs to be ported to newer Lua eclasses in future but currently we can import it fine) - upstream is willing to accept the PR for supporting proper lua versioning - https://github.com/TACC/Lmod/issues/481
- import intel mpi - PR - https://github.com/gentoo/gentoo/pull/18452

I have asked the current maintainer for modules in ::science to make a PR for moving it to ::gentoo, hopefully they will have time to do it sometime soon :)

I will make the PR for Lmod (and its dependencies).

Comment 8 Aisha Tammy 2020-12-05 22:26:41 UTC

(In reply to Andrew Savchenko from comment #6)
> (In reply to Aisha Tammy from comment #5)
> > As an awesome coincidence, we have modules and Lmod both currently available
> > (up to date and all tests passing) in the ::science overlay!
> 
> Well, this is not entirely a coincidence: if you will look at the git log, I
> supported modules back in 2017 during GSoC and improved the package with
> results based on some testing it received during GSoC.

Ah, I need to look at git logs more :P

Comment 9 Aisha Tammy 2020-12-06 02:39:15 UTC

here is the PR for adding Lmod - https://github.com/gentoo/gentoo/pull/18523

All tests are passing for both packages.

Currently, they are using the standard lua build system, which will need to be moved to the newer lua eclasses but thats for another time.