606880 – sys-apps/portage: support for downloading multiple distfiles in one fetch command invocation

Bug 606880 - sys-apps/portage: support for downloading multiple distfiles in one fetch command invocation

Summary: sys-apps/portage: support for downloading multiple distfiles in one fetch com...

Status:	UNCONFIRMED

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Enhancement/Feature Requests (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal enhancement
Assignee:	Portage team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	377365
	Show dependency tree

Reported:	2017-01-23 08:09 UTC by Christopher Head
Modified:	2017-09-03 19:53 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Christopher Head 2017-01-23 08:09:06 UTC

At present, Portage invokes $FETCHCOMMAND once for each distfile it needs to download. For the most part, this is no big deal. However, for something like Texlive, where a handful of ebuilds results in downloading roughly 2,200 fairly small distfiles, I think better performance could be had if the fetch command could be invoked once and passed many URLs, letting it amortize DNS resolution and TCP connection setup and slow start costs across many distfiles (especially considering that, for most users, every single distfile will come from the same server).

Reproducible: Always

Comment 1 Bruno Henc 2017-07-18 01:38:36 UTC

While this would be a nice feature to have, if you're even remotely proficient
at shell programming you should be able to write a script that downloads in parallel all the ebuilds you want.

Without any benchmarks backing this up, I really don't think the overhead is that great: For the most part, you're likely to saturate the network connection with a single fetch command.

I'm going to also have to include a quote here:

The Eight Fallacies of Distributed Computing

Essentially everyone, when they first build a distributed application, makes the following eight assumptions. All prove to be false in the long run and all cause big trouble and painful learning experiences. — Peter Deutsch

#0 The network is reliable
#1 Latency is zero
#2 Bandwidth is infinite
#3 The network is secure
#4 Topology doesn’t change
#5 There is one administrator
#6 Transport cost is zero
#7 The network is homogeneous

So I'm arguing that #2 holds true for most cases, and you're far more likely to saturate the network connection with a single ebuild fetch command. Remember,
just because you're on a 1Gbps uplink, doesn't mean the Gentoo mirror is,
and I'm pretty sure it won't allow you to saturate its whole available bandwidth.

And as a consequence of the above bandwidth restriction, you're worse off if you start reusing DNS resolution: if the Gentoo mirror gets saturated, you'll get worse performance than if you run the fetch command one by one.

#4 will also bite you: if you issue the fetch command once, and halfway through handling all the passed URLs a shark starts nibbling on an undersea cable, you're effectively generating not one, but a dozen Error 404s.

Can you guarantee that all the switches connection the client to the distfile server will perform the same when you throw 1 connection vs. 1100 connections?
Slow start might be there for a reason: And just because one connection got routed one way, doesn't mean it will be routed the same way next time.

Having mirrored the whole portage tree with the following script(written by me),
I see no reason why one couldn't do the same for the texlive ebuilds if you're in a hurry.
https://github.com/antematherian/portage-mirror-distfiles-script

I see no reason why one would complicate the portage codebase even more if the same job can be done by a few lines of code. The sentiment behind the bug report however, is valid: if you're in a hurry and have the bandwidth, you should probably fetch in parallel. But then again if you have that kind of bandwidth, maybe creating a mirror and using a network filesystem would be a good idea.

Anyway, just my 2 cents.

Comment 2 Robin Johnson archtester

2017-09-03 00:39:11 UTC

@bruno:
The OP said nothing about doing it in parallel. They only wanted to reduce overhead duplication. Even if the mirror was very close by, doing exec() 2200 and opening new connections each time is a lot more than a single exec w/ pipelined keep-alive HTTP.

The problem that I do see, is that portage does a lot of decision already: trying multiple locations for a given file, and that logic is important.

I do support handing the fetch information down to another program. This was the subject of GSOC project that I mentored several years ago, but the approach taken there was that FETCHCOMMAND was hooked up to a hook command that passed the work to a daemon and blocked until the daemon completed the fetch.

The best option would be output like:
for each distfile:
  (destination filename), (list of one or more URLs to try)

The second best would be it generating the first-attempt filename+URL pairs for each distfile, and then having a second pass of stuff that failed.

Comment 3 Christopher Head 2017-09-03 19:43:54 UTC

Right, as Robin suggested, this was never about parallelizing fetch. To answer your points, Bruno:

I generally saturate my local connection with one fetch, so parallelization isn’t particularly helpful. That’s not what I wanted anyway. Realistically, I’m not going to make a dent in the mirror’s bandwidth, so reusing resolution won’t hurt my performance and downloading everything from one mirror is just fine. Anyway since Portage AFAICT just grabs every file from the first entry in $GENTOO_MIRRORS and only uses the others as backups, it wouldn’t make any difference.

I guess a reasonable solution might be a fetch tool that dæmonizes itself in order to hang onto a keepalive connection, which wouldn’t need implementing inside Portage itself.