Bug 901241

Summary:	sci-libs/vtk-9.2.5: computes absurdly large build-time RAM requirement
Product:	Gentoo Linux	Reporter:	Will Simoneau <bugzilla>
Component:	Current packages	Assignee:	Gentoo Science Related Packages <sci>
Status:	RESOLVED FIXED
Severity:	normal	CC:	negril.nx+gentoo, proxy-maint, waebbl-gentoo
Priority:	Normal	Keywords:	PullRequest
Version:	unspecified
Hardware:	All
OS:	Linux
See Also:	https://github.com/gentoo/gentoo/pull/31487
Whiteboard:
Package list:		Runtime testing required:	---

Description Will Simoneau 2023-03-14 19:51:18 UTC

* Checking for at least 372736 MiB RAM ...
 [ !! ]
 * There is NOT at least 372736 MiB RAM
 * Checking for at least 12288 MiB disk space at "/var/tmp/portage/sci-libs/vtk-9.2.5/temp" ...
 [ ok ]
 * 
 * Space constraints set in the ebuild were not met!
 * The build will most probably fail, you should enhance the space
 * as per failed tests.
 * 
 * ERROR: sci-libs/vtk-9.2.5::gentoo failed (pretend phase):
 *   Build requirements not met!

This is with MAKEOPTS="-j52 -l128" and USE="cuda" on a machine with 2x Xeon Gold 6230R (52 cores / 104 threads) and 192GB RAM.

I've never run this machine OOM despite running "emerge -j13" in 3 build chroots simultaneously on a regular basis.  VTK 9.1.0 has successfully built on 11 previous occasions, and I very much doubt that VTK 9.2.5 requires drastically more memory to build than 9.1.0 did.

Reproducible: Always

Steps to Reproduce:
1. Set a high (though completely appropriate) number of parallel jobs in MAKEOPTS="-jN"
2. Have less than (N)*7168MB RAM installed
3. Try to build sci-libs/vtk-9.2.5
Actual Results:  
 * Checking for at least 372736 MiB RAM ...
 [ !! ]
 * There is NOT at least 372736 MiB RAM

Expected Results:  
Build of sci-libs/vtk-9.2.5 should have been attempted

sci-libs/vtk/vtk-9.2.5.ebuild:

vtk_check_reqs() {
    local dsk=4096
    local mem=$(( $(usex cuda 7168 0) ))

    dsk=$(( $(usex doc 3072 0) + ${dsk} ))
    dsk=$(( $(usex examples 3072 0) + ${dsk} ))
    dsk=$(( $(usex cuda 8192 0) + ${dsk} ))

    # In case users are not aware of the extra NINJAOPTS, check
    # for the more common MAKEOPTS, in case NINJAOPTS is empty
    local jobs=1
    if [[ -n "${NINJAOPTS}" ]]; then
        jobs=$(makeopts_jobs "${NINJAOPTS}" "$(get_nproc)")
    else
        if [[ -n "${MAKEOPTS}" ]]; then
            jobs=$(makeopts_jobs "${MAKEOPTS}" "$(get_nproc)")
        fi
    fi
    mem=$(( ${mem} * ${jobs} ))

    use cuda && export CHECKREQS_MEMORY=${mem}M
    export CHECKREQS_DISK_BUILD=${dsk}M

    check-reqs_pkg_${EBUILD_PHASE}
}


Multiplying the memory requirement by the number of build jobs is complete nonsense.  The minimum amount of memory required for the build to succeed is in general *NOT* a linear function of the number of parallel build jobs.

IMO it might be reasonable to just change the minimum RAM check to trigger a warning instead of an error.

Comment 1 Sam James archtester

2023-03-14 19:57:35 UTC

>Multiplying the memory requirement by the number of build jobs is complete nonsense.  The minimum amount of memory required for the build to succeed is in general *NOT* a linear function of the number of parallel build jobs.

It's a fair rule of thumb to say each job may take up to ~2GB. It's conservative but it doesn't work so well with large numbers of cores or RAM, yes.

It indeed could be tweaked but it's not totally unreasonable to start with.

Comment 2 Bernd 2023-03-15 11:31:01 UTC

(In reply to Will Simoneau from comment #0)

> IMO it might be reasonable to just change the minimum RAM check to trigger a
> warning instead of an error.

The eclass always triggers an error if the amount is exceeded. It has however a user settable flag, to not fail and instead issue a warning only. You might give this a try. I'm happy if you can report any findings, so we can improve the way we calculate the requirements.

Like Sam said, it's a conservative approach to estimate the needed amount RAM. Individual nvcc tasks require up to 7G of RAM. Some files require less, but most of the files, I have watched, were in a range close to this amount.

I have a machine with only one CPU (8 cores, 16 threads) and 32G RAM (+32G swap) and ran into a OOM like scenario. It wasn't a kernel OOM, but the machine became slow up to a point, where there was virtually no responsiveness left. I hard-resetted the machine after some increased time in this state to get it back to working.

The cuda USE flag was masked for a considerable amount of time, because of major issues for the last few versions and was only re-enabled with v9.2 IIRC.

Comment 3 Will Simoneau 2023-03-15 20:43:54 UTC

(In reply to Bernd from comment #2)
> (In reply to Will Simoneau from comment #0)
> 
> > IMO it might be reasonable to just change the minimum RAM check to trigger a
> > warning instead of an error.
> 
> The eclass always triggers an error if the amount is exceeded. It has
> however a user settable flag, to not fail and instead issue a warning only.
> You might give this a try. 

Thanks for the tip - I wasn't aware that I could just set CHECKREQS_DONOTHING=1 to get the behavior I wanted.

> I'm happy if you can report any findings, so we can improve the way we
> calculate the requirements.

FWIW, peak memory usage during build of sci-libs/vtk-9.2.5[cuda] with:
    dev-util/nvidia-cuda-toolkit-11.8.0-r3
    sys-devel/gcc-11.3.1_p20230120-r1
    sys-devel/binutils-2.39-r4
    MAKEOPTS="-j52 -l128"
    VTK_CUDA_ARCH=pascal
... seems to have only been ~9.1G.
I did see one nvcc process peak at ~7GB RSS and a few others at ~4.5GB RSS, but I didn't see multiple large-RSS jobs running simultaneously.  (Which of course might come down to pure luck / arbitrary execution order of the individual compile jobs)

> I have a machine with only one CPU (8 cores, 16 threads) and 32G RAM (+32G
> swap) and ran into a OOM like scenario. It wasn't a kernel OOM, but the
> machine became slow up to a point, where there was virtually no
> responsiveness left. I hard-resetted the machine after some increased time
> in this state to get it back to working.

Yeah, IME Linux does a poor job of handling situations that involve parallel build jobs pushing the system deep into swap.  In theory those can be dealt with by appropriate use of cgroups, though I personally consider that approach far too complicated to bother with.  Instead I just configure machines that have >=32GB RAM or so without swap.  Can't softlock due to a swap-storm if there isn't any swap to begin with ;-)
I find it much nicer overall to just let the OOM-killer step in when necessary.

Comment 4 Bernd 2023-03-19 11:36:46 UTC

I just tested without using swap with -j8 settings and it finished without getting OOM, but at some points, there was only ~1-2G free RAM left.

My approach now would be to test for 4*7G if more than 4 jobs are set in MAKEOPTS/NINJAOPTS and for jobs*7G if less than 4 jobs are set, so we would have a max of 28G RAM required.

Comment 5 Larry the Git Cow gentoo-dev

2023-06-28 21:11:00 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=1590d50aab5daa504272ee6104c5b06e3d5d037b

commit 1590d50aab5daa504272ee6104c5b06e3d5d037b
Author:     Paul Zander <negril.nx+gentoo@gmail.com>
AuthorDate: 2023-06-16 16:32:01 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2023-06-28 21:09:20 +0000

    sci-libs/vtk: reduce required memory for cuda compilation
    
    Prior logic assumes infinite parallel nvcc calls, while real-life
    testing shows a max of 4.
    This adds crude logic to require no more memory then needed for 4
    parallel calls.
    
    Bug: https://bugs.gentoo.org/901241
    
    Signed-off-by: Paul Zander <negril.nx+gentoo@gmail.com>
    Signed-off-by: Sam James <sam@gentoo.org>

 sci-libs/vtk/vtk-9.2.5.ebuild | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 6 Bernd 2023-06-29 05:24:05 UTC

Thanks a lot for the patch and the update to 9.2.6. I sadly didn't have any time during the last few weeks to look about my Gentoo ebuilds at all and am thankful you cared about it.

Comment 7 Sam James archtester

2023-06-30 07:38:03 UTC

(In reply to Bernd from comment #6)
> Thanks a lot for the patch and the update to 9.2.6. I sadly didn't have any
> time during the last few weeks to look about my Gentoo ebuilds at all and am
> thankful you cared about it.

No worries Bernd, just glad you're OK. I was wondering about sending an email.

Hope to speak soon