Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 833567 - [Future EAPI] src_fetch_extra phase the runs after src_unpack
Summary: [Future EAPI] src_fetch_extra phase the runs after src_unpack
Status: CONFIRMED
Alias: None
Product: Gentoo Hosted Projects
Classification: Unclassified
Component: PMS/EAPI (show other bugs)
Hardware: All All
: Normal normal (vote)
Assignee: PMS/EAPI
URL: https://archives.gentoo.org/gentoo-pr...
Whiteboard:
Keywords:
Depends on:
Blocks: future-eapi
  Show dependency tree
 
Reported: 2022-02-17 20:57 UTC by Zac Medico
Modified: 2023-02-17 12:00 UTC (History)
10 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zac Medico gentoo-dev 2022-02-17 20:57:48 UTC
For golang ebuilds (perhaps java and rust as well) it would be useful to have a src_fetch_extra phase that runs after src_unpack, so that the ebuild can fetch dependencies which my be so numerous that they would bloat the Manfest too much if fetched via SRC_URI. It would be the responsibility of the ebuild to ensure that files are fetched and verified via a secure mechanism.
Comment 1 Ulrich Müller gentoo-dev 2022-02-17 21:11:46 UTC

*** This bug has been marked as a duplicate of bug 481434 ***
Comment 2 Zac Medico gentoo-dev 2022-02-17 21:31:00 UTC
This is different from bug 481434, because src_fetch_extra needs to run *after* src_unpack has unpacked the regular sources. This means that the sources fetched via SRC_URI and provide verification data for use in src_fetch_extra.
Comment 3 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2022-02-17 21:45:59 UTC
This would effectively make them live ebuilds, and therefore lose keywords.  I don't see how that would really be different than the current approach of using src_unpack() to do the extra fetching.
Comment 4 Ulrich Müller gentoo-dev 2022-02-17 21:46:27 UTC
(In reply to Zac Medico from comment #2)

*shrug* It is largely the same problem (namely, an extra phase for fetching sources for a live ebuild), and in bug 481434 we already have some discussion about it.
Comment 5 Zac Medico gentoo-dev 2022-02-17 21:57:02 UTC
(In reply to Michał Górny from comment #3)
> This would effectively make them live ebuilds, and therefore lose keywords.

Not necessarily, in cases where src_fetch_extra provides reproducible results (verified by go.sum for example).
 
> I don't see how that would really be different than the current approach of
> using src_unpack() to do the extra fetching.

True, it is very close.

(In reply to Ulrich Müller from comment #4)
> (In reply to Zac Medico from comment #2)
> 
> *shrug* It is largely the same problem (namely, an extra phase for fetching
> sources for a live ebuild), and in bug 481434 we already have some
> discussion about it.

The problem that I'm trying to solve is distinctly different from a live ebuild though, since src_fetch_extra aims for reproducible results.
Comment 6 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2022-02-17 22:37:23 UTC
(In reply to Zac Medico from comment #5)
> (In reply to Michał Górny from comment #3)
> > This would effectively make them live ebuilds, and therefore lose keywords.
> 
> Not necessarily, in cases where src_fetch_extra provides reproducible
> results (verified by go.sum for example).

"Non-reproducible results" are only one of the many problems with live ebuilds.  But even if it was the only one, how would you ensure that all ebuilds are using it correctly?
Comment 7 Zac Medico gentoo-dev 2022-02-17 23:18:37 UTC
(In reply to Michał Górny from comment #6)
> "Non-reproducible results" are only one of the many problems with live
> ebuilds.  But even if it was the only one, how would you ensure that all
> ebuilds are using it correctly?

It would require careful review. However, the benefit is that it makes it practical for us to package a class of software that we would otherwise be practically unable to package.
Comment 8 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2022-02-18 08:26:38 UTC
How would you ensure that the review happens?  Who would be responsible for it?  What would be the criteria for accepting/rejecting it?

What about users on a data plan who have "free" local mirror, yet a random Go package eats all of their data plan?
Comment 9 Zac Medico gentoo-dev 2022-02-18 18:33:53 UTC
(In reply to Michał Górny from comment #8)
> How would you ensure that the review happens?  Who would be responsible for
> it?  What would be the criteria for accepting/rejecting it?

The ebuild maintainer would be responsible. They can request review on the gentoo-dev mailing list. The criteria for accepting would be that the results are reproducible. If the results are not reproducible then the ebuild should use PROPERTIES=live instead.

> What about users on a data plan who have "free" local mirror, yet a random
> Go package eats all of their data plan?

We can use a PROPERTIES="fetch-extra" value to tag these ebuilds, and users can use ACCEPT_PROPERTIES="-fetch-extra" to mask these ebuilds.
Comment 10 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2022-02-18 18:56:26 UTC
(In reply to Zac Medico from comment #9)
> (In reply to Michał Górny from comment #8)
> > How would you ensure that the review happens?  Who would be responsible for
> > it?  What would be the criteria for accepting/rejecting it?
> 
> The ebuild maintainer would be responsible. They can request review on the
> gentoo-dev mailing list. The criteria for accepting would be that the
> results are reproducible. If the results are not reproducible then the
> ebuild should use PROPERTIES=live instead.
> 
> > What about users on a data plan who have "free" local mirror, yet a random
> > Go package eats all of their data plan?
> 
> We can use a PROPERTIES="fetch-extra" value to tag these ebuilds, and users
> can use ACCEPT_PROPERTIES="-fetch-extra" to mask these ebuilds.

While I don't love us being in this situation, this is what I was thinking about last night, and I don't think it's crazy, even if not ideal.
Comment 11 William Hubbs gentoo-dev 2022-02-18 19:03:03 UTC
I think it is not quite correct to call these live ebuilds.

I can't speak for other languages, but I can say that for go, the only
time these ebuilds would be live is if "direct" appears in the value of
the GOPROXY environment variable [1], and that can be controlled or
checked for in the eclass.

[1] https://go.dev/ref/mod#environment-variables
Comment 12 Ulrich Müller gentoo-dev 2022-02-18 20:41:46 UTC
They would access the network and fetch files during the build phase, which is the definition of a live ebuild.

Which means that they cannot have any keywords.
Comment 13 Ulrich Müller gentoo-dev 2022-02-18 20:45:30 UTC
(In reply to Zac Medico from comment #9)
> We can use a PROPERTIES="fetch-extra" value to tag these ebuilds, and users
> can use ACCEPT_PROPERTIES="-fetch-extra" to mask these ebuilds.

But that would be double-masking them?
Comment 14 Patrick McLean gentoo-dev 2022-02-18 21:22:19 UTC
(In reply to Ulrich Müller from comment #12)
> They would access the network and fetch files during the build phase, which
> is the definition of a live ebuild.
> 
> Which means that they cannot have any keywords.

So are you suggesting that we should have no keywords for Kubernetes or Docker? It seems to me that would not be very friendly to our users. I think adding PROPERTIES="fetch-extra", and allowing users to mask on that (though probably not by default) would be the best approach.

I don't think we should mask a huge number of popular packages by default because of some philosophical objects to how the upstream ecosystem works.

To be clear: I agree that this is a terrible situation, and I truly despise how these upstream ecosystems have decided to do things. However, I would like to think we accept that this is the world we live in, and as such we have to make things that work within this world, and do not make things harder than necessary for our users.
Comment 15 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2022-02-18 22:05:14 UTC
I'd really appreciate if you really avoided trying to take users hostage in order to push your own ideas forward.  You know as well as we do that this is not the only possible solution, and repackaging is the solution that works now and isn't as a bad as what you're proposing.

There are numerous issues with the proposed solution, and the way it's proposed means that these issues are going to catch unaware users in the face (or in the pocket).  Really, pushing a new item telling "sorry, we're introducing a new horrible solution that defeats the principle Gentoo followed for years. if you disagree, do this and you'll lose a bunch of packages because we won't support sane packaging anymore" is not a solution.

However hard you try to turn it around, an ebuild that circumvents the standard fetching procedures and does random network activity behind the user's back is a live ebuild and always will be a live ebuild.  Therefore, it can't have keywords and this proposal adds nothing to what we already have as src_unpack() w/ PROPERTIES=live.  That said, it's even worse than src_fetch() that at least improves how live ebuilds are fetched.

So I think we should reject this outright as 1) it doesn't solve the problem at hand, 2) it provides no advantage over how src_unpack() works today.  I don't see any compelling advantage to arbitrarily split src_unpack() in two when it's going to be used by ~112 live packages, and I'm pretty sure it's going to make no real difference to end users whether it's just src_unpack() or src_unpack() followed by src_fetch_extra() (the order vs naming is also confusing BTW).

Do you really need me to repeat why giving ebuilds random Internet access is wrong?  At least some of the arguments can be found in devmanual [1].  Off the tip of my head:

1. This is hard to get right, and to review properly.  When done wrong, it is a guaranteed security issue.  When done right, it still increases the attack surface by adding additional local verification methods.

2. The users have no way of knowing up front how much data is going to be fetched.  The best we can do is say "these packages will probably fetch lots of data.  They may eat your data plan or fill up your disk".

3. They need custom support for local mirrors and FETCHCOMMAND, and then users have to do custom setup to get their local mirrors.  Most likely separately for every user (you can't assume it's just going to be Go).

4. They need custom support for caching, and users will probably need to manually clean all the extra caches.  If they don't, they're going to fetch lot more data than necessary.

5. How is this going to work with unstable Internet connections?  Resuming/retrying distfile fetching is something that is easily done.  Again, the ebuilds will have to reinvent that or people will have their builds failing, then they will have to start over and fail again...

6. What if the remote server is temporarily unavailable or goes completely dead?  Without Gentoo and local mirrors users won't be able to install that stuff anymore.

7. What about users with a data plan?  We're literally talking about ebuilds that're going to fetch some unknown amounts of data without any way for the users to know up front.

8. Finally, this is a huge privacy problem.  Again, we're talking of allowing ebuilds to access the Internet in arbitrary ways, with no real control of what is sent or where.

Yes, all these issues apply to live ebuilds already.  These are part of the reason why they don't have keywords, i.e. don't permit users to install them by default.  This is also the reply to "why ebuilds pinned to EGIT_COMMIT are live ebuilds too".

[1] https://devmanual.gentoo.org/ebuild-writing/functions/src_test/#tests-that-require-network-or-service-access
Comment 16 William Hubbs gentoo-dev 2022-02-19 01:06:48 UTC
@chutzpah:
Actually there is nothing to do for kubernetes or docker with regard to
this issue. They both vendor their dependencies, and they are fine as
long as they do that.

The difference in the go world is with software like cosign or spire,
which do not vendor their dependencies.

Currently, I have cosign packaged using a vendor tarball,  and spire
packaged the way the eclass supports it. Look at the difference between
the ebuilds and manifests of these two packages.

@ulm:
My understanding of live ebuilds is that they fetch from version control systems.
In the Go world, as I Cited above, whether or not this happens with
dependencies can be controlled with an environment variable which I could
force or check in the eclass, the go proxies, like https://proxy.golang.org,
are not version control systems. You download immutable artifacts from them, so
they are more like mirrors than anything.

@mgorny:
Actually, for Go packages anyway, you are wrong about the current way of
doing this. Look at go-module.eclass.  We don't repackage anything. We
add the contents of go.sum to SRC_URI when upstream doesn't
vendor their dependencies. This allows the dependencies to be mirrored
on the Gentoo mirrors.

One downside is this causes the multi-thousand line ebuilds and manifests
you see in the tree. 
Take a look at net-vpn/tailscale or app-misc/spire for examples of how
big these are. there's nothing I can do about this.

The more pressing concern is that ebuilds like this can cause SRC_URI to
have so much data that portage crashes.

The other way I've thought about doing this is, for example, what I did
with app-containers/cosign. I created a vendor tarball, which is very
simple to do with the go tooling. To make this the official way, I would
have to change the eclass slightly, butvendor tarballs are concerning to
the infra team because of the amount of duplicate data that will be in the
vendor tarballs from different packages or different versions of
packages.

Currently there are three choices in the go world when an upstream author
doesn't vendor their dependencies.

1) use EGO_SUM and hope it works and doesn't crash portage.
2) use a vendor tarball(this option must be used if 1) blows up even
  though it is unsupported and concerning to the infra team)
3) ask upstream to start vendoring  (I have seen upstreams actively
  delete the vendor folder, so I suspect this would be a hard sell, so I haven't
  bothered).

  The reason src_fetch_extra is being proposed is to allow a pm friendly
  way of handling this kind of scenario In the go world I could just use
  the third party tooling to verify the dependencies.

I'm not particularly a fan of what these upstreams are doing, I'm just
attempting to be pragmatic. Something needs to happen, because as said above,
this is the world we live in now. Rust does a
similar thing with crates, and nodejs also does online downloading of
dependencies when it builds.

You can say that we have followed x policy for years and I agree;
however, there are times when that policy has to change, and I see this
as one of those times. The current ways of handling this DO NOT WORK
WELL, and that is why this bug is open in the first place.
Comment 17 Ulrich Müller gentoo-dev 2022-02-19 16:14:46 UTC
How do other distros handle this situation? For example, what does Debian do for its source packages?
Comment 18 William Hubbs gentoo-dev 2022-02-20 19:18:06 UTC
The only thing I've found so far is archlinux's method of handling this
type of package (their cosign build script),
and they don't appear to do anything special to handle the dependencies.
They just do a normal upstream go build.

I will keep looking for any info from other distros.

https://github.com/archlinux/svntogit-community/tree/packages/cosign/trunk
Comment 19 Zac Medico gentoo-dev 2022-02-21 01:01:05 UTC
Here's an arch linux example for nerdctl. It calls `go mod download` in the prepare function:


https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=nerdctl
Comment 20 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2022-02-21 22:59:00 UTC
(In reply to William Hubbs from comment #16)
> You can say that we have followed x policy for years and I agree;
> however, there are times when that policy has to change, and I see this
> as one of those times. The current ways of handling this DO NOT WORK
> WELL, and that is why this bug is open in the first place.

This policy has very good reasons for its existence, and you haven't answered any of my concerns.  Your proposed solution is just horrible.  Shouting doesn't make it any better.
Comment 21 12101111 2022-02-28 03:16:22 UTC
(In reply to Ulrich Müller from comment #17)
> How do other distros handle this situation? For example, what does Debian do
> for its source packages?

Debian do package golang and rust source code

https://packages.debian.org/sid/all/golang-golang-x-mod-dev/filelist
https://packages.debian.org/sid/amd64/librust-adler32-dev/filelist

They use a tool called dh-make-golang[1] to generate build script of golang library/binary. They have a policy on how to package golang library[2][3]. 

[1] https://github.com/Debian/dh-make-golang
[2] https://go-team.pages.debian.net/packaging.html#_using_dh_make_golang
[3] https://people.debian.org/~stapelberg/2015/07/27/dh-make-golang.html

They have similar policy and tool for rust.

https://wiki.debian.org/Teams/RustPackaging/Policy
https://crates.io/crates/debcargo

See also: https://wiki.gentoo.org/wiki/Project:Rust/rust-dev
Comment 22 William Hubbs gentoo-dev 2022-03-04 04:42:16 UTC
I looked at Debian's system, and it looks like they are still using
GOPATH, which is being replaced by go modules, so I'm not sure how
relevant their info is.

WRT to the rust-dev page you linked, I'll respond with an article from
LWN about packaging go libraries separately.

https://lwn.net/Articles/835599/

I will look into it further, but initially I don't see anything from
Debian that will help with this.
Comment 23 James Le Cuirot gentoo-dev 2022-06-04 20:18:28 UTC
I hate to say it, but I'm with mgorny on this one. If repackaging can be done cheaply, this doesn't buy us anything. Presumably it's not difficult to script this up, then it only costs us disk space and bandwidth.

It won't help Java in the same way that it helps Rust and Go either. With the latter two, you generally fetch all the dependencies in source form and build them all into one final binary. With Maven, Gradle, or similar for Java, you fetch all the dependencies in binary form and they're either combined into one big jar or kept as separate jars. You can force it to fetch them in source form, but you'll end up downloading and building half the Internet, because there are no optional build time dependencies in Java. I've mentioned this plenty of times before, but I'm just pointing out that this feature won't make it any better.
Comment 24 amano.kenji 2023-01-26 12:11:55 UTC
I'm plagued by go-module.eclass because it forces me to find a host for dependency tarballs.

Whenever I upload one big-ass dependency tarballs to git LFS, I have to re-upload everything else. This is not sustainable at all.

This is why I stopped making ebuilds for go programs. Every time I update a dependency tarball, I have to wipe out re-upload everything in order to prevent git hosting provider from banning my free account. Otherwise, the leftover git LFS files would grow too big for a free account.

The only sustainable solution was to stop packaging go programs.
Comment 25 amano.kenji 2023-01-26 12:29:47 UTC
Never mind.

I just read https://wiki.gentoo.org/wiki/Writing_go_Ebuilds which says

> For those who have access to a git forge such as GitLab, Gitea, GitHub, … create an empty repository, add a new tag for each new version (named ${P}) and upload the tarballs to these ”releases“. Make sure that the host of the forge allows this usage.

I just logged into codeberg.org and verified that I could create an empty repository for each go ebuild, create a tag for each version of ebuild, and attach a dependency tarball to each tag. This is far more sustainable than git LFS which I don't like at this point.
Comment 26 amano.kenji 2023-01-27 02:57:56 UTC
Why don't we handle EGO_SUM in a file outside ebuild?

Manifest may become bloated, but portage should be able to handle a bigger manifest since it's written in python. Python can handle it.

If bash is the bottleneck, prevent it from handling EGO_SUM.
Comment 27 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2023-01-27 09:01:49 UTC
(In reply to amano.kenji from comment #26)
> Why don't we handle EGO_SUM in a file outside ebuild?
> 
> Manifest may become bloated, but portage should be able to handle a bigger
> manifest since it's written in python. Python can handle it.
> 
> If bash is the bottleneck, prevent it from handling EGO_SUM.

I see we have a volunteer to make a good proposal, prepare the spec update and patches for all package managers.
Comment 28 amano.kenji 2023-01-27 10:56:49 UTC
My understanding of this issue is not deep.

The impression I got is that EGO_SUM should not be in ebuild for bash or some other reason.

If we only need to put EGO_SUM in a different file than ebuild, then go-module.eclass can read external files, or go-module.eclass can call a helper to generate SRC_URI dynamically from the external file.

Did I misunderstand the issue?
Comment 29 amano.kenji 2023-01-28 11:17:20 UTC
I think there is another way without either EGO_SUM or dependency tarballs.

A go-module.eclass helper can be written in a sane programming language like raku.

In my experiences, raku package system integrates very well with POSIX operating system packages. Janet language is also good, but its packaging system is not very well thought out. It makes sense to use the programming language with the most advanced packaging system for packaging helpers.

The helper would parse go.sum and generate SRC_URI dynamically. Thus, EGO_SUM is not necessary. It can also parse go.sum and extract the dependencies. Thus, go-module.eclass doesn't have to deal with extracting dependencies into vendor directory.

go-module.eclass would only deal with setting vendor directory and calling the helper.

This still leads to bigger manifests, but it avoids complexity in go-module.eclass.

Raku has grammar which is better than regex for parsing.
Comment 30 amano.kenji 2023-01-28 11:19:05 UTC
My overlay has raku.eclass which requires a modified ebuild for dev-lang/raku.

My overlay bootstrapped raku ecosystem.
Comment 31 William Hubbs gentoo-dev 2023-02-17 02:15:29 UTC
As I said on the other bug, that doesn't really solve the problem.
The problem isn't just Manifests, it is the size of SRC_URI.
SRC_URI can have too many entries which would cause the size of ${A} to
be too big to be an environment variable.
Comment 32 amano.kenji 2023-02-17 12:00:55 UTC
GNU Guix developers are trying to add package parameters which are essentially USE flags.

Perhaps, GNU Guix may turn out to be a good alternative to gentoo in a few decades.

Since GNU Guix packages are written in guile scheme, they don't have to worry about the limitation of environment variable size.