425682 – True parallel fetch for each job running (current parallel-fetch is asynchronous)

Bug 425682 - True parallel fetch for each job running (current parallel-fetch is asynchronous)

Summary: True parallel fetch for each job running (current parallel-fetch is asynchron...

Status:	UNCONFIRMED

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Enhancement/Feature Requests (show other bugs)
Hardware:	All Linux

Importance:	Normal enhancement (vote)
Assignee:	Portage team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	377365
	Show dependency tree

Reported:	2012-07-10 15:15 UTC by Marcus Becker
Modified:	2024-07-20 04:25 UTC (History)
CC List:	4 users (show)

See Also:	936287
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marcus Becker 2012-07-10 15:15:39 UTC

Since Jobs were introduced in portage, it is nice to see up to X jobs running at a time. What I noticed is that there is still only one fetch going on in the background?

Reproducible: Always

Steps to Reproduce:
1. It starts with 4 jobs, downloads the first, then the second etc.
2. lets say the first was a small package of ~500kb and is already done and the third is a larger one etc.
3. at some stage only 1 job is running because other jobs have to wait for it to be downloaded
Actual Results:  
As an example: I can build PHP in ~2min on my machine, but it takes me ~5-6min to download (ok, my connection is not very good), this stalls every following job that could have been done in the meantime.

Expected Results:  
It would be nice, if every job triggers its own fetch parallel?

Comment 1 Jeremy Olexa (darkside) (RETIRED) archtester

2012-07-10 15:54:44 UTC

If you have a slow connection, adding more fetch jobs to be processed at once will NOT help anything, it will just stall in a different way.

Comment 2 Marcus Becker 2012-07-10 16:11:56 UTC

But if one package has ~50mb to download and it stalls a 500kb package, I think it would be an inprovement. How many jobs you want to run can be set in the make.conf anyway.

Comment 3 Zac Medico gentoo-dev

2012-07-10 20:28:32 UTC

(In reply to comment #0)
> 1. It starts with 4 jobs, downloads the first, then the second etc.

It will actually download all 4 in parallel. The relevant code is here:

http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=ef58bc7573ddce5e3a5466eea50160b81de8edf4

When downloading in parallel, each fetcher's output goes to the corresponding build log (and that part of the build log is discarded if the fetch is successful).

> 2. lets say the first was a small package of ~500kb and is already done and
> the third is a larger one etc.
> 3. at some stage only 1 job is running because other jobs have to wait for
> it to be downloaded

Something like this could happen if all other jobs depend on the one that's currently being fetched/logged in /var/log/emerge-fetch.log. In order to fix this, we'd have to create separate logs for each fetcher.

(In reply to comment #1)
> If you have a slow connection, adding more fetch jobs to be processed at
> once will NOT help anything, it will just stall in a different way.

We can add a --fetch-jobs=N option so that people can tune the number of concurrent fetch jobs for their connection speed.

Comment 4 Marcus Becker 2012-07-14 12:09:21 UTC

One example:
Calculating dependencies... done!
>>> Verifying ebuild manifests
>>> Starting parallel fetch
>>> Emerging (1 of 13) sys-kernel/linux-firmware-20120708
>>> Emerging (2 of 13) media-libs/libpng-1.5.12
>>> Jobs: 0 of 13 complete, 1 running               Load avg: 1.07, 0.86, 0.88

since linux-firmware is 15M to download, it stalls the other jobs?

Comment 5 Zac Medico gentoo-dev

2012-07-14 20:37:13 UTC

(In reply to comment #4)
> One example:
> Calculating dependencies... done!
> >>> Verifying ebuild manifests
> >>> Starting parallel fetch
> >>> Emerging (1 of 13) sys-kernel/linux-firmware-20120708
> >>> Emerging (2 of 13) media-libs/libpng-1.5.12
> >>> Jobs: 0 of 13 complete, 1 running               Load avg: 1.07, 0.86, 0.88
> 
> since linux-firmware is 15M to download, it stalls the other jobs?

Well, you could be looking at a case of bug 403895 there, which is fixed in portage-2.1.11.x.

Comment 6 Sam James archtester

2023-05-26 03:44:02 UTC

slashbeast: I remember we talked about this a few months ago, is there another bug for this or is it just this one?

Comment 7 Piotr Karbowski (RETIRED) gentoo-dev

2023-05-26 16:37:26 UTC

I have memory of discussing it I think in regard to golang projects dependencies taking ages to fetch when we GODEP was a thing but I cannot find a bug for it that I've opened so perhaps I never created it. This seems valid though.

Comment 8 Joe Kappus 2024-02-02 21:05:51 UTC

Also, can we rename parallel-fetch to background-fetch? OP makes a good point that anyone seeing this option is going to think it relates to jobs.

I now have multiple packages with >100 dependencies to download (blame go, rust, node stuff) and most are only a few hundred KB, each one takes a few seconds to communicate with the mirrors. It adds up to many minutes.

A true parallel fetch with multiple fetch jobs at a time would greatly reduce this. dev-vcs/repo seems to default to 4 for fetching git repos (github doesn't seem to like it when going much higher, but 4 has been bulletproof). HTTPS mirror fetching we could probably safely go even higher...

Comment 9 Zac Medico gentoo-dev

2024-02-02 23:10:10 UTC

(In reply to Joe Kappus from comment #8)
> Also, can we rename parallel-fetch to background-fetch? OP makes a good
> point that anyone seeing this option is going to think it relates to jobs.

Yeah, or maybe background-prefetch (internals refer to the corresponding fetcher as prefetchers).

It feels kind of crazy to rename it after it has existed for nearly 20 years now, so maybe we should just update the documentation to compare/contrast with the sort of parallel fetch that can happen with emerge --jobs.

> I now have multiple packages with >100 dependencies to download (blame go,
> rust, node stuff) and most are only a few hundred KB, each one takes a few
> seconds to communicate with the mirrors. It adds up to many minutes.
> 
> A true parallel fetch with multiple fetch jobs at a time would greatly
> reduce this. dev-vcs/repo seems to default to 4 for fetching git repos
> (github doesn't seem to like it when going much higher, but 4 has been
> bulletproof). HTTPS mirror fetching we could probably safely go even
> higher...

I'm thinking about how we could handle the logging here. I suppose in this case we could simply send the fetch output to /dev/null (that's what parallel fetch originally did in https://gitweb.gentoo.org/proj/portage.git/commit/?id=0e5af163b1fe7cb5ec9101930ce0905713ed775b), then retry serially with logging for anything that failed.