Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 425682 - True parallel fetch for each job running (current parallel-fetch is asynchronous)
Summary: True parallel fetch for each job running (current parallel-fetch is asynchron...
Status: CONFIRMED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Enhancement/Feature Requests (show other bugs)
Hardware: All Linux
: Normal enhancement (vote)
Assignee: Portage team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 377365
  Show dependency tree
 
Reported: 2012-07-10 15:15 UTC by Marcus Becker
Modified: 2024-08-01 19:51 UTC (History)
4 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marcus Becker 2012-07-10 15:15:39 UTC
Since Jobs were introduced in portage, it is nice to see up to X jobs running at a time. What I noticed is that there is still only one fetch going on in the background?

Reproducible: Always

Steps to Reproduce:
1. It starts with 4 jobs, downloads the first, then the second etc.
2. lets say the first was a small package of ~500kb and is already done and the third is a larger one etc.
3. at some stage only 1 job is running because other jobs have to wait for it to be downloaded
Actual Results:  
As an example: I can build PHP in ~2min on my machine, but it takes me ~5-6min to download (ok, my connection is not very good), this stalls every following job that could have been done in the meantime.

Expected Results:  
It would be nice, if every job triggers its own fetch parallel?
Comment 1 Jeremy Olexa (darkside) (RETIRED) archtester gentoo-dev Security 2012-07-10 15:54:44 UTC
If you have a slow connection, adding more fetch jobs to be processed at once will NOT help anything, it will just stall in a different way.
Comment 2 Marcus Becker 2012-07-10 16:11:56 UTC
But if one package has ~50mb to download and it stalls a 500kb package, I think it would be an inprovement. How many jobs you want to run can be set in the make.conf anyway.
Comment 3 Zac Medico gentoo-dev 2012-07-10 20:28:32 UTC
(In reply to comment #0)
> 1. It starts with 4 jobs, downloads the first, then the second etc.

It will actually download all 4 in parallel. The relevant code is here:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=ef58bc7573ddce5e3a5466eea50160b81de8edf4

When downloading in parallel, each fetcher's output goes to the corresponding build log (and that part of the build log is discarded if the fetch is successful).

> 2. lets say the first was a small package of ~500kb and is already done and
> the third is a larger one etc.
> 3. at some stage only 1 job is running because other jobs have to wait for
> it to be downloaded

Something like this could happen if all other jobs depend on the one that's currently being fetched/logged in /var/log/emerge-fetch.log. In order to fix this, we'd have to create separate logs for each fetcher.

(In reply to comment #1)
> If you have a slow connection, adding more fetch jobs to be processed at
> once will NOT help anything, it will just stall in a different way.

We can add a --fetch-jobs=N option so that people can tune the number of concurrent fetch jobs for their connection speed.
Comment 4 Marcus Becker 2012-07-14 12:09:21 UTC
One example:
Calculating dependencies... done!
>>> Verifying ebuild manifests
>>> Starting parallel fetch
>>> Emerging (1 of 13) sys-kernel/linux-firmware-20120708
>>> Emerging (2 of 13) media-libs/libpng-1.5.12
>>> Jobs: 0 of 13 complete, 1 running               Load avg: 1.07, 0.86, 0.88

since linux-firmware is 15M to download, it stalls the other jobs?
Comment 5 Zac Medico gentoo-dev 2012-07-14 20:37:13 UTC
(In reply to comment #4)
> One example:
> Calculating dependencies... done!
> >>> Verifying ebuild manifests
> >>> Starting parallel fetch
> >>> Emerging (1 of 13) sys-kernel/linux-firmware-20120708
> >>> Emerging (2 of 13) media-libs/libpng-1.5.12
> >>> Jobs: 0 of 13 complete, 1 running               Load avg: 1.07, 0.86, 0.88
> 
> since linux-firmware is 15M to download, it stalls the other jobs?

Well, you could be looking at a case of bug 403895 there, which is fixed in portage-2.1.11.x.
Comment 6 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2023-05-26 03:44:02 UTC
slashbeast: I remember we talked about this a few months ago, is there another bug for this or is it just this one?
Comment 7 Piotr Karbowski (RETIRED) gentoo-dev 2023-05-26 16:37:26 UTC
I have memory of discussing it I think in regard to golang projects dependencies taking ages to fetch when we GODEP was a thing but I cannot find a bug for it that I've opened so perhaps I never created it. This seems valid though.
Comment 8 Joe Kappus 2024-02-02 21:05:51 UTC
Also, can we rename parallel-fetch to background-fetch? OP makes a good point that anyone seeing this option is going to think it relates to jobs.

I now have multiple packages with >100 dependencies to download (blame go, rust, node stuff) and most are only a few hundred KB, each one takes a few seconds to communicate with the mirrors. It adds up to many minutes.

A true parallel fetch with multiple fetch jobs at a time would greatly reduce this. dev-vcs/repo seems to default to 4 for fetching git repos (github doesn't seem to like it when going much higher, but 4 has been bulletproof). HTTPS mirror fetching we could probably safely go even higher...
Comment 9 Zac Medico gentoo-dev 2024-02-02 23:10:10 UTC
(In reply to Joe Kappus from comment #8)
> Also, can we rename parallel-fetch to background-fetch? OP makes a good
> point that anyone seeing this option is going to think it relates to jobs.

Yeah, or maybe background-prefetch (internals refer to the corresponding fetcher as prefetchers).

It feels kind of crazy to rename it after it has existed for nearly 20 years now, so maybe we should just update the documentation to compare/contrast with the sort of parallel fetch that can happen with emerge --jobs.

> I now have multiple packages with >100 dependencies to download (blame go,
> rust, node stuff) and most are only a few hundred KB, each one takes a few
> seconds to communicate with the mirrors. It adds up to many minutes.
> 
> A true parallel fetch with multiple fetch jobs at a time would greatly
> reduce this. dev-vcs/repo seems to default to 4 for fetching git repos
> (github doesn't seem to like it when going much higher, but 4 has been
> bulletproof). HTTPS mirror fetching we could probably safely go even
> higher...

I'm thinking about how we could handle the logging here. I suppose in this case we could simply send the fetch output to /dev/null (that's what parallel fetch originally did in https://gitweb.gentoo.org/proj/portage.git/commit/?id=0e5af163b1fe7cb5ec9101930ce0905713ed775b), then retry serially with logging for anything that failed.
Comment 10 Zac Medico gentoo-dev 2024-07-31 14:29:25 UTC
For the prefetcher jobs, we can create and lock the same directories we would use for the corresponding builds, and log to $T/build.log as usual. We do something similar for pkg_pretend, and do not delete it if it fails.
Comment 11 Zac Medico gentoo-dev 2024-08-01 19:51:06 UTC
In https://github.com/gentoo/portage/pull/1361 related to bug 936273, I've converted the fetch function into an async_fetch coroutine function that could be modified to support concurrent fetch of multiple files. However, given the large number of files we could be dealing with here, it would be helpful to have a fetch job server similar to cpu job server proposed in bug 692576. Note that for fetch jobs there is no equivalent to what --load-average provides for cpu jobs. 

Also, obviously it would be nice to avoid opening too many concurrent connections to the same server, though the fetch function already rotates through mirrors to balance load.