I'm creating this issue here as I didn't see another one I could link to. Github provides source archives and many repositories use them, but don't know that they are generated on the fly and may result in different downloaded files. In our case, it gives size or checksum mismatches. Over time, I have repeatedly hit such issues; while they don't happen on portage because the gentoo mirrors are hit first, they do happen in overlays. But I'm not alone: - https://github.com/keybase/client/issues/10800#issuecomment-375831096 mentions support@github.com: > Currently, we don't make any guarantees about the byte-for-byte equivalence of any tarball which is generated on the fly. > If a team wants to produce a stable tarball, they will have to create it themselves and put it as a download in the releases. > We realize that this approach can be confusing since we put links to the on-the-fly tarballs right where the user-provided ones would exist. Our team is aware of this and will keep it in mind for future iterations of the feature, though we can't make any promises of specific changes. and a blurb from a GitHub'er: > These are not planned changes but rather they come about from updating > the software involved in creating them. The main purpose of the auto- > generated archives are for someone to download the source from the > website if they don't want to bother with downloading the repository. > It is not meant to be reliable or a way to distribute software releases > and nothing in the software stack is made to try to produce consistent > archives. This is no different from creating a tarball locally and > trying verify it with the hash of the tarball someone created on their > own machine. > The only way to get a known-good checksum for a tarball is to have > upstream (or the packagers) prepare the release and upload the tarball > alongside its checksum. This is true regardless of GitHub. There is a > feature on the site where maintainers can upload their own assets for a > release though clearly not too many people actually use it. - https://github.com/easybuilders/easybuild-easyconfigs/issues/5151 - https://github.com/pfsense/FreeBSD-ports/commit/3691e1aae77dc8d6b3c65fa597ffc833cd5e2973 - https://github.com/comby-tools/comby/issues/328 <- occurrence from 2021-12 In order to be safe, every upstream maintainer should manually create artifacts and re-upload them to github. There's documentation on how to generate reproducible tarballs at https://reproducible-builds.org/docs/archives/ and another way is to download and then re-upload artifacts using the gitlab API, eg. as I did in: https://github.com/neuropoly/distriploy/blob/820db8e43f16f9e74360e464f25bba777d8b4a68/distriploy/release_github.py#L43 The potentially affected packages are these matching globs - github.com*archive/refs/tags/v${PV}.tar.gz - github.com*archive/v${PV}.tar.gz In some cases, releases have been done manually by upstream, but the gentoo ebuilds are referring to an automatic artifact, eg. app-admin/doctl. Of course, another way, which would be less work for distros and less wasted storage for github, would be for them to find a deterministic way to generate these archives...
Even if not guaranteed, this is hardly ever been an issue and (as you say) we do have mirrors that are copying and keeping that frozen-in-time copy (little sense in replicating what mirrors do by mirroring it manually first). And we already prioritize proper release tarballs when they exist (unless they have a problem, like missing files we need). We don't have control over what overlays do, so I'm unsure what you want us to do here?
(In reply to Ionen Wolkens from comment #1) > And we already prioritize proper release tarballs when they exist > (unless they have a problem, like missing files we need). On that note, feel free to fill bugs if an ebuild should use it but isn't.
ionen yeah this is rather low priority for ::gentoo due to mirrors, but as you say, the existence of existing stable artifacts giving reproducible downloads could be scanned (sam mentioned https://github.com/pkgcore/pkgcheck/issues/473). Here's my dirty script that re-computes manifests and overnight (partial) results on ::gentoo: https://gist.github.com/zougloub/7fdea04c66e856fcac1000c398d795e1 At least with this, overlays can be scanned.
Manually filed a case where an alternate download source could be used. This could be automated for others.
Work-in-progress pertaining to this issue here: https://gitlab.com/cJ/gentoo-bug-881037-github-reproducible-downloads Some of the checks could probably land into pkgcheck.
Filed an issue with github, because it would be so much more elegant if they were the ones to fix the problem.
I discussed this a bit at https://lists.reproducible-builds.org/pipermail/rb-general/2021-October/002422.html The tl;dr is that this is not actually a real issue to my reasonably-certain knowledge, although I'd be interested in seeing credible proof consisting of before-and-after tarballs. Github auto-generated tarballs are not "guaranteed" by github, because they are the result of running the git-archive program which github doesn't personally guarantee. Luckily, it doesn't matter because that's all on the git project. ... As far as I can tell, the discussion here is basically about theoreticals? Lots of links to issues from 5+ years ago. I'm only aware of a couple realistic sources of non-reproducible behavior, assuming you don't use a truly ancient version of git to generate them. - unreproducible gzip (busybox gzip was "recently" fixed to be reproducible) - renaming the github repository such as to capitalize or lowercase the repo name, as that is embedded in the base filepath - obviously, re-tagging - gitattributes export-subst can embed information from the git repository, and depending on the information and how you define it, that can be non-reproducible (for example, abbreviated commit hashes can grow longer as the repo grows, some methods of embedding the author/committer will respect a mailmap file) Case 1 is solved, cases 2 and 3 are actually legitimate cases of upstream modification, and case 4 is a bug in upstream's export-subst handling. Are there other real issues which aren't about a commit from git.git dating back to 2013? ... Granted, if upstream themselves *provide* hand-generated dist tarballs this is superior, for several reasons that mostly don't have to do with reproducibility -- they can use better-than-gzip compression, they can have generated files included, they can have non-useful files *excluded* -- but if upstream doesn't provide them and maybe doesn't see much point because they don't use autotools, why is that a distro problem?
I'm working on some changes there: https://github.com/gentoo/gentoo/pull/28247 where there's a bunch of minor improvements to supply chain security (feel free to review). I'll soon (as soon as I'm done re-fetching everything) include a commit with updated checksums corresponding to the new auto-generated archives that have changed, which also shouldn't hurt. I think saying github doesn't control git doesn't consider that they could have used another solution than git-archive in order to serve the archives (and maybe they do), or that they could contribute to an open source project (git, libgit2, whatever) to make their export more deterministic. Anyway, I have also filed a request with github.
To clearly answer your question Eli, yes I found occurrences of changed archives in > 2020... will report soon a convincing list because there are a bunch in ::gentoo.
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=80cc6b358ee5d7fe7a791dcd80c9297fe6a42fc9 commit 80cc6b358ee5d7fe7a791dcd80c9297fe6a42fc9 Author: Jérôme Carretero <cJ@zougloub.eu> AuthorDate: 2022-11-12 20:31:00 +0000 Commit: Michał Górny <mgorny@gentoo.org> CommitDate: 2022-11-14 03:41:03 +0000 dev-python/pyproject-metadata: canonicalize SRC_URI Signed-off-by: Jérôme Carretero <cJ-gentoo@zougloub.eu> Bug: https://bugs.gentoo.org/881037 Signed-off-by: Michał Górny <mgorny@gentoo.org> dev-python/pyproject-metadata/pyproject-metadata-0.5.0.ebuild | 2 +- dev-python/pyproject-metadata/pyproject-metadata-0.6.1.ebuild | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
Giving this to QA; I can't see any other team in Gentoo taking action on this.
Honestly, I don't recall the last time I've seen checksum mismatch due to GitHub archives being unstable. All that I've seen is checksum mismatches due to upstream retagging, and that's what we want to catch.