Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 881037 - github.com on-the-fly automatic artifacts / archives not guaranteed to be deterministic, shouldn't be used
Summary: github.com on-the-fly automatic artifacts / archives not guaranteed to be det...
Status: UNCONFIRMED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal
Assignee: Gentoo Quality Assurance Team
URL:
Whiteboard:
Keywords:
Depends on: 881053 881055
Blocks:
  Show dependency tree
 
Reported: 2022-11-11 22:43 UTC by cJ
Modified: 2023-01-30 21:43 UTC (History)
6 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description cJ 2022-11-11 22:43:48 UTC
I'm creating this issue here as I didn't see another one I could link to.

Github provides source archives and many repositories use them, but don't know that they are generated on the fly and may result in different downloaded files.
In our case, it gives size or checksum mismatches.

Over time, I have repeatedly hit such issues; while they don't happen on portage because the gentoo mirrors are hit first, they do happen in overlays.

But I'm not alone:

- https://github.com/keybase/client/issues/10800#issuecomment-375831096 mentions support@github.com:

> Currently, we don't make any guarantees about the byte-for-byte equivalence of any tarball which is generated on the fly.

> If a team wants to produce a stable tarball, they will have to create it themselves and put it as a download in the releases.

> We realize that this approach can be confusing since we put links to the on-the-fly tarballs right where the user-provided ones would exist. Our team is aware of this and will keep it in mind for future iterations of the feature, though we can't make any promises of specific changes.

and a blurb from a GitHub'er:

> These are not planned changes but rather they come about from updating
> the software involved in creating them. The main purpose of the auto-
> generated archives are for someone to download the source from the
> website if they don't want to bother with downloading the repository.

> It is not meant to be reliable or a way to distribute software releases
> and nothing in the software stack is made to try to produce consistent
> archives. This is no different from creating a tarball locally and
> trying verify it with the hash of the tarball someone created on their
> own machine.

> The only way to get a known-good checksum for a tarball is to have
> upstream (or the packagers) prepare the release and upload the tarball
> alongside its checksum. This is true regardless of GitHub. There is a
> feature on the site where maintainers can upload their own assets for a
> release though clearly not too many people actually use it.

- https://github.com/easybuilders/easybuild-easyconfigs/issues/5151

- https://github.com/pfsense/FreeBSD-ports/commit/3691e1aae77dc8d6b3c65fa597ffc833cd5e2973
- https://github.com/comby-tools/comby/issues/328 <- occurrence from 2021-12

In order to be safe, every upstream maintainer should manually create artifacts and re-upload them to github.
There's documentation on how to generate reproducible tarballs at
 https://reproducible-builds.org/docs/archives/
and another way is to download and then re-upload artifacts using the gitlab API, eg. as I did in:
 https://github.com/neuropoly/distriploy/blob/820db8e43f16f9e74360e464f25bba777d8b4a68/distriploy/release_github.py#L43

The potentially affected packages are these matching globs
- github.com*archive/refs/tags/v${PV}.tar.gz
- github.com*archive/v${PV}.tar.gz

In some cases, releases have been done manually by upstream, but the gentoo ebuilds are referring to an automatic artifact, eg. app-admin/doctl.

Of course, another way, which would be less work for distros and less wasted storage for github, would be for them to find a deterministic way to generate these archives...
Comment 1 Ionen Wolkens gentoo-dev 2022-11-12 01:20:15 UTC
Even if not guaranteed, this is hardly ever been an issue and (as you say) we do have mirrors that are copying and keeping that frozen-in-time copy (little sense in replicating what mirrors do by mirroring it manually first). And we already prioritize proper release tarballs when they exist (unless they have a problem, like missing files we need).

We don't have control over what overlays do, so I'm unsure what you want us to do here?
Comment 2 Ionen Wolkens gentoo-dev 2022-11-12 01:22:45 UTC
(In reply to Ionen Wolkens from comment #1)
> And we already prioritize proper release tarballs when they exist
> (unless they have a problem, like missing files we need).
On that note, feel free to fill bugs if an ebuild should use it but isn't.
Comment 3 cJ 2022-11-12 07:21:47 UTC
ionen yeah this is rather low priority for ::gentoo due to mirrors, but as you say, the existence of existing stable artifacts giving reproducible downloads could be scanned (sam mentioned https://github.com/pkgcore/pkgcheck/issues/473).

Here's my dirty script that re-computes manifests and overnight (partial) results on ::gentoo:
 https://gist.github.com/zougloub/7fdea04c66e856fcac1000c398d795e1
At least with this, overlays can be scanned.
Comment 4 cJ 2022-11-12 07:32:27 UTC
Manually filed a case where an alternate download source could be used. This could be automated for others.
Comment 5 cJ 2022-11-12 11:50:19 UTC
Work-in-progress pertaining to this issue here:
 https://gitlab.com/cJ/gentoo-bug-881037-github-reproducible-downloads

Some of the checks could probably land into pkgcheck.
Comment 6 cJ 2022-11-12 23:49:45 UTC
Filed an issue with github, because it would be so much more elegant if they were the ones to fix the problem.
Comment 7 Eli Schwartz gentoo-dev 2022-11-13 00:24:31 UTC
I discussed this a bit at https://lists.reproducible-builds.org/pipermail/rb-general/2021-October/002422.html

The tl;dr is that this is not actually a real issue to my reasonably-certain knowledge, although I'd be interested in seeing credible proof consisting of before-and-after tarballs.


Github auto-generated tarballs are not "guaranteed" by github, because they are the result of running the git-archive program which github doesn't personally guarantee. Luckily, it doesn't matter because that's all on the git project.

...

As far as I can tell, the discussion here is basically about theoreticals? Lots of links to issues from 5+ years ago.

I'm only aware of a couple realistic sources of non-reproducible behavior, assuming you don't use a truly ancient version of git to generate them.
- unreproducible gzip (busybox gzip was "recently" fixed to be reproducible)
- renaming the github repository such as to capitalize or lowercase the repo name, as that is embedded in the base filepath
- obviously, re-tagging
- gitattributes export-subst can embed information from the git repository, and depending on the information and how you define it, that can be non-reproducible (for example, abbreviated commit hashes can grow longer as the repo grows, some methods of embedding the author/committer will respect a mailmap file)


Case 1 is solved, cases 2 and 3 are actually legitimate cases of upstream modification, and case 4 is a bug in upstream's export-subst handling.

Are there other real issues which aren't about a commit from git.git dating back to 2013?


...

Granted, if upstream themselves *provide* hand-generated dist tarballs this is superior, for several reasons that mostly don't have to do with reproducibility -- they can use better-than-gzip compression, they can have generated files included, they can have non-useful files *excluded* -- but if upstream doesn't provide them and maybe doesn't see much point because they don't use autotools, why is that a distro problem?
Comment 8 cJ 2022-11-13 08:34:57 UTC
I'm working on some changes there:
 https://github.com/gentoo/gentoo/pull/28247
where there's a bunch of minor improvements to supply chain security (feel free to review).

I'll soon (as soon as I'm done re-fetching everything) include a commit with updated checksums corresponding to the new auto-generated archives that have changed, which also shouldn't hurt.

I think saying github doesn't control git doesn't consider that they could have used another solution than git-archive in order to serve the archives (and maybe they do), or that they could contribute to an open source project (git, libgit2, whatever) to make their export more deterministic.
Anyway, I have also filed a request with github.
Comment 9 cJ 2022-11-13 08:40:19 UTC
To clearly answer your question Eli, yes I found occurrences of changed archives in > 2020... will report soon a convincing list because there are a bunch in ::gentoo.
Comment 10 Larry the Git Cow gentoo-dev 2022-11-14 03:46:15 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=80cc6b358ee5d7fe7a791dcd80c9297fe6a42fc9

commit 80cc6b358ee5d7fe7a791dcd80c9297fe6a42fc9
Author:     Jérôme Carretero <cJ@zougloub.eu>
AuthorDate: 2022-11-12 20:31:00 +0000
Commit:     Michał Górny <mgorny@gentoo.org>
CommitDate: 2022-11-14 03:41:03 +0000

    dev-python/pyproject-metadata: canonicalize SRC_URI
    
    Signed-off-by: Jérôme Carretero <cJ-gentoo@zougloub.eu>
    Bug: https://bugs.gentoo.org/881037
    Signed-off-by: Michał Górny <mgorny@gentoo.org>

 dev-python/pyproject-metadata/pyproject-metadata-0.5.0.ebuild | 2 +-
 dev-python/pyproject-metadata/pyproject-metadata-0.6.1.ebuild | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
Comment 11 Mike Gilbert gentoo-dev 2022-11-22 17:14:30 UTC
Giving this to QA; I can't see any other team in Gentoo taking action on this.
Comment 12 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2022-11-23 08:44:23 UTC
Honestly, I don't recall the last time I've seen checksum mismatch due to GitHub archives being unstable.  All that I've seen is checksum mismatches due to upstream retagging, and that's what we want to catch.