Since Jobs were introduced in portage, it is nice to see up to X jobs running at a time. What I noticed is that there is still only one fetch going on in the background? Reproducible: Always Steps to Reproduce: 1. It starts with 4 jobs, downloads the first, then the second etc. 2. lets say the first was a small package of ~500kb and is already done and the third is a larger one etc. 3. at some stage only 1 job is running because other jobs have to wait for it to be downloaded Actual Results: As an example: I can build PHP in ~2min on my machine, but it takes me ~5-6min to download (ok, my connection is not very good), this stalls every following job that could have been done in the meantime. Expected Results: It would be nice, if every job triggers its own fetch parallel?
If you have a slow connection, adding more fetch jobs to be processed at once will NOT help anything, it will just stall in a different way.
But if one package has ~50mb to download and it stalls a 500kb package, I think it would be an inprovement. How many jobs you want to run can be set in the make.conf anyway.
(In reply to comment #0) > 1. It starts with 4 jobs, downloads the first, then the second etc. It will actually download all 4 in parallel. The relevant code is here: http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=ef58bc7573ddce5e3a5466eea50160b81de8edf4 When downloading in parallel, each fetcher's output goes to the corresponding build log (and that part of the build log is discarded if the fetch is successful). > 2. lets say the first was a small package of ~500kb and is already done and > the third is a larger one etc. > 3. at some stage only 1 job is running because other jobs have to wait for > it to be downloaded Something like this could happen if all other jobs depend on the one that's currently being fetched/logged in /var/log/emerge-fetch.log. In order to fix this, we'd have to create separate logs for each fetcher. (In reply to comment #1) > If you have a slow connection, adding more fetch jobs to be processed at > once will NOT help anything, it will just stall in a different way. We can add a --fetch-jobs=N option so that people can tune the number of concurrent fetch jobs for their connection speed.
One example: Calculating dependencies... done! >>> Verifying ebuild manifests >>> Starting parallel fetch >>> Emerging (1 of 13) sys-kernel/linux-firmware-20120708 >>> Emerging (2 of 13) media-libs/libpng-1.5.12 >>> Jobs: 0 of 13 complete, 1 running Load avg: 1.07, 0.86, 0.88 since linux-firmware is 15M to download, it stalls the other jobs?
(In reply to comment #4) > One example: > Calculating dependencies... done! > >>> Verifying ebuild manifests > >>> Starting parallel fetch > >>> Emerging (1 of 13) sys-kernel/linux-firmware-20120708 > >>> Emerging (2 of 13) media-libs/libpng-1.5.12 > >>> Jobs: 0 of 13 complete, 1 running Load avg: 1.07, 0.86, 0.88 > > since linux-firmware is 15M to download, it stalls the other jobs? Well, you could be looking at a case of bug 403895 there, which is fixed in portage-2.1.11.x.
slashbeast: I remember we talked about this a few months ago, is there another bug for this or is it just this one?
I have memory of discussing it I think in regard to golang projects dependencies taking ages to fetch when we GODEP was a thing but I cannot find a bug for it that I've opened so perhaps I never created it. This seems valid though.
Also, can we rename parallel-fetch to background-fetch? OP makes a good point that anyone seeing this option is going to think it relates to jobs. I now have multiple packages with >100 dependencies to download (blame go, rust, node stuff) and most are only a few hundred KB, each one takes a few seconds to communicate with the mirrors. It adds up to many minutes. A true parallel fetch with multiple fetch jobs at a time would greatly reduce this. dev-vcs/repo seems to default to 4 for fetching git repos (github doesn't seem to like it when going much higher, but 4 has been bulletproof). HTTPS mirror fetching we could probably safely go even higher...
(In reply to Joe Kappus from comment #8) > Also, can we rename parallel-fetch to background-fetch? OP makes a good > point that anyone seeing this option is going to think it relates to jobs. Yeah, or maybe background-prefetch (internals refer to the corresponding fetcher as prefetchers). It feels kind of crazy to rename it after it has existed for nearly 20 years now, so maybe we should just update the documentation to compare/contrast with the sort of parallel fetch that can happen with emerge --jobs. > I now have multiple packages with >100 dependencies to download (blame go, > rust, node stuff) and most are only a few hundred KB, each one takes a few > seconds to communicate with the mirrors. It adds up to many minutes. > > A true parallel fetch with multiple fetch jobs at a time would greatly > reduce this. dev-vcs/repo seems to default to 4 for fetching git repos > (github doesn't seem to like it when going much higher, but 4 has been > bulletproof). HTTPS mirror fetching we could probably safely go even > higher... I'm thinking about how we could handle the logging here. I suppose in this case we could simply send the fetch output to /dev/null (that's what parallel fetch originally did in https://gitweb.gentoo.org/proj/portage.git/commit/?id=0e5af163b1fe7cb5ec9101930ce0905713ed775b), then retry serially with logging for anything that failed.
For the prefetcher jobs, we can create and lock the same directories we would use for the corresponding builds, and log to $T/build.log as usual. We do something similar for pkg_pretend, and do not delete it if it fails.
In https://github.com/gentoo/portage/pull/1361 related to bug 936273, I've converted the fetch function into an async_fetch coroutine function that could be modified to support concurrent fetch of multiple files. However, given the large number of files we could be dealing with here, it would be helpful to have a fetch job server similar to cpu job server proposed in bug 692576. Note that for fetch jobs there is no equivalent to what --load-average provides for cpu jobs. Also, obviously it would be nice to avoid opening too many concurrent connections to the same server, though the fetch function already rotates through mirrors to balance load.