Hi all, Hopefully this is being posted in the right place. Apologies in advance as this is where I was recommended to file the bug per NeddySeagoon's instructions. I've been wanting to set up some old computers running Gentoo as it was when the machines would have been new as a historical timepiece of Linux "of the era" Unfortunately there does not seem to be any sort of "archive" of old Gentoo distfiles. NeddySeagoon has some on his "bloodnoc" server and it is possible to somewhat reconstruct package-by-package a distfiles mirror by using a git snapshot of portage and importing tarballs from 3rd party servers from Google searching, but nothing complete (especially with regard to patches) I've reached out a few times to Robbat who it has said has old LTO tape archives back to around 2003~ and offered to dump and host them myself, independent of the project if desired but have received no response. Is there any sort of official 'backup' of distfiles of years gone past? Gentoo as a distribution is over 20 years old, the further away we get the harder it becomes to recover that sort of history. Distributions such as Ubuntu, Debian, Fedora and others do seem to provide their own "archive" mirrors. Though I have not conducted a definitive survey of other distributions to compare with (being specific releases instead of a rolling-release model I would assume makes it a much smaller archival challenge)
I have encountered problems where older systems, sometimes only a few years old, need incremental updates to modernize are unable to locate distfiles. A huge portion of portage just cannot resolve dependencies and outright breaks if it deviates from today by more than ~6 months sometimes. It's very frustrating, and even more frustrating when one ebuild fails because of one missing distfile. I agree and it would be great to have a mirror that actually has all the old distfiles.
Yeah, I know robbat2's definitely said he wants to do this. FWIW, for the timebeing (I know some/all of you may be aware of these already), these links may be useful: - http://bloodnoc.org/~roy/old_Gentoo/ (inc. http://bloodnoc.org/~roy/old_Gentoo/DISTFILES.year/) - http://bloodnoc.org/~roy/olde-distfiles/ - https://mirror.reenigne.net/ - https://dev.gentoo.org/~sam/distfiles/ (not that many, but got some of slyfox's old ones) - https://dev.gentoo.org/~vapier/dist/
I've checked out Neddy's ~roy mirror (and even offered to duplicate it "across the pond" to save him some bandwidth) and it does have quite a few distfiles. Though very little of the original stage tarballs Has anyone been able to contact Robbat2 about the tape backups they've mentioned having? I am still happy to pay to have them shipped out to be dumped and mirrored if it is too time consuming for them. My fear as always is the longer we wait the more likely things are to be lost to bitrot
(In reply to intelminer from comment #3) > I've checked out Neddy's ~roy mirror (and even offered to duplicate it > "across the pond" to save him some bandwidth) and it does have quite a few > distfiles. Though very little of the original stage tarballs > > Has anyone been able to contact Robbat2 about the tape backups they've > mentioned having? > > I am still happy to pay to have them shipped out to be dumped and mirrored > if it is too time consuming for them. My fear as always is the longer we > wait the more likely things are to be lost to bitrot I think my answer elsewhere got lost. The tapes ALSO have my own data on them, so I won't be shipping them somewhere else. I'm also preparing to move house, and hopefully be able to spin up the tape drive after that, to export all of the data (the tape aging is my concern there). Separately, are you aware of the Software Heritage Project and their objective to archive every open source distfile ever produced? https://docs.softwareheritage.org/ What would be useful as an effort you could undertake, is building a comprehensive list of every distfile ever referenced in Gentoo, along with their known checksums in all possible hashes, and then tracking their present state. Most notably, it would let me know what distfiles are high-value when I do restore my tape archives.
> What would be useful as an effort you could undertake, is building a > comprehensive list of every distfile ever referenced in Gentoo, along with > their known checksums in all possible hashes, and then tracking their > present state. > > Most notably, it would let me know what distfiles are high-value when I do > restore my tape archives. As I understand it, it's possible to pull every file referenced via git a git clone of https://anongit.gentoo.org/git/repo/gentoo/historical.git followed by git checkout `git rev-list -n 1 --first`-parent --before="YYYY-MM-DD" master` That should provide a complete portage snapshot at any desired day, which should include all hashes/checksums. Correlating what distfiles already exist is much more difficult. I doubt any that still exist in portage are less than a few versions old by now (I don't know the date ranges on your tape backups) As mentioned above by Sam, a few users have taken to mirroring some of their old distfiles. A lot of more popular packages like GNU programs, KDE or the Linux Kernel still have their entire histories listed online as well. It's entirely possible to "reconstruct" a system package by package that way (albeit entirely by hand manually editing every desired package to remove patches) https://i.redd.it/0novnyhuz8271.png One major blocker that Neddy has noted on his Wiki page about it and that I've also experienced is getting patches for those packages. Even if a package may be recoverable "manually" by googling for its tarball, we don't have the Gentoo specific patches anymore. Many of which aren't listed, but simply point to the "patches" folder. It might be worth simply dumping all of it and then sorting through things after the fact?
(In reply to Robin Johnson from comment #4) > What would be useful as an effort you could undertake, is building a > comprehensive list of every distfile ever referenced in Gentoo, along with > their known checksums in all possible hashes, and then tracking their > present state. I could easily do that for the historical gentoo-x86 CVS repo. On first glance, I see about 300000 unique distfiles (from Manifest DIST entries and files/digest-*). Any preference for the format? Would a spreadsheet be o.k., with filename, size, MD5, RMD160, ... as its columns?
(In reply to intelminer from comment #5) > One major blocker that Neddy has noted on his Wiki page about it and that > I've also experienced is getting patches for those packages. Even if a > package may be recoverable "manually" by googling for its tarball, we don't > have the Gentoo specific patches anymore. Many of which aren't listed, but > simply point to the "patches" folder. > We may be able to pull out at least some of these if we have them archived (we keep old developers' devspaces for a bit). Obviously patches in the tree (CVS or git) are there forever, the issue is stuff which got shoved on mirror://gentoo/ or similar.
Hi team, Just checking in to see if the relevant parties have made any progress :) @Robbat mentioned they were moving house at the time and were intending to spin up the old tapes after. Hopefully that was successful (those tapes are understandably quite old and may have deteriorated) @Ulrich said it'd be possible to pull all the relevant distfile data out of the gentoo-x86 repo (I'm not sure if that includes other architectures since Gentoo is source based? IE: it's just an ~ARCH keyword) @Sam also mentioned that some of the patch files may be floating around on old archived accounts. I'm not sure if they're the same patches that were added to the production repo or not but they're definitely worth investigating
(In reply to Nathan Shearer from comment #1) > I have encountered problems where older systems, sometimes only a few years > old, need incremental updates to modernize are unable to locate distfiles. I'm currently finding myself in this exact situation. I'm trying to find missing distfiles, but it doesn't go well for me. What's especially frustrating is that distfiles disappear from dev.gentoo.org, which is supposed to be a "stable and reliable infrastructure" for distfiles, per https://devmanual.gentoo.org/general-concepts/mirrors/index.html . (For example, currently I can't merge x11-terms/rxvt-unicode-9.22-r7 because of this.)
I have a load of distfiles accumulated over the last 4-5 years for all the software I frequently use, including rxvt-unicode. I need to do some drive cleanup soon so I'm wondering where would be a good place to put them.
(In reply to Pavel Goran from comment #9) > What's especially frustrating is that distfiles disappear from > dev.gentoo.org, which is supposed to be a "stable and reliable > infrastructure" for distfiles, per > https://devmanual.gentoo.org/general-concepts/mirrors/index.html . (For > example, currently I can't merge x11-terms/rxvt-unicode-9.22-r7 because of > this.) Here's a copy of the files for you: https://dev.gentoo.org/~robbat2/distfiles/rxvt-unicode-9.22.tar.bz2 https://dev.gentoo.org/~robbat2/distfiles/rxvt-unicode-9.22_24-bit-color_cpixl-20201108.patch.xz But i'm wondering why you specifically need x11-terms/rxvt-unicode-9.22-r7
To the root of requirements discussion for the historical archive: for the August 2015 - November 2023 distfiles, there's ~4.82TiB of storage potentially online. I haven't weeded out distfiles with fetch/mirroring restrictions, but that will shrink it due to the massive distfiles used by commercial software. - 427467 unique SHA512 hashes - 4033 collisions of filenames w/ different SHA512 (upstream changed the file) What's the layout of this potential repo going to be? Because of the collision problem it cannot be just the filename hash that's used for live distfiles.
How many collisions are there if you only count those that collide within the same calendar month? It'd be most convenient to be able to point GENTOO_MIRRORS at your collection within the same date range, similar to how the Arch Linux Archive[1] works. [1]: https://wiki.archlinux.org/title/Arch_Linux_Archive#/packages
For the time being, not a long-term solution, I've put my files here: https://archive.org/download/gentoo-distfiles_202311
(In reply to Esteve Varela Colominas from comment #13) > How many collisions are there if you only count those that collide within > the same calendar month? It'd be most convenient to be able to point > GENTOO_MIRRORS at your collection within the same date range, similar to how > the Arch Linux Archive[1] works. > > [1]: https://wiki.archlinux.org/title/Arch_Linux_Archive#/packages To try and flesh out a design based on what I think you're suggesting here: distfiles-by-date/YYYYMMDD/$FILENAME distfiles-by-date/YYYYMMDD/$NAMEHASH_PREFIX/$FILENAME that are symlinks to an actual file: distfiles-by-content-hash/$hh/$hhhh/$FILENAME Representing approximately what the distfiles structure looked like at a given date. This structure would be good enough if a conflicted distfile existed for at least 1 day with a given hash.
I second the suggestion for simply having *everything* in a single distfiles folder in the same way that Arch Linux does. It may make it more cumbersome for a browser to render such an enormous list of files, but it would make it significantly easier to just point an old system at it That aside. I assume your own archives go back to August of 2015 @robbat? Gentoo's history spans almost a quarter of a century at this point if we include Enochs 1999 release. The further back we go the more difficult it's likely going to be to find distfiles or ISO's, especially for non-x86/AMD64 architectures
(In reply to Cursed Silicon from comment #16) > I second the suggestion for simply having *everything* in a single distfiles > folder in the same way that Arch Linux does. It may make it more cumbersome > for a browser to render such an enormous list of files, but it would make it > significantly easier to just point an old system at it You'd probably prefer to point the example system to ..../distfiles-by-date/YYYYMMDD/ corresponding to a point-in-time you were trying to load. I can imagine also supplying a .../distfiles-last-seen/ that symlinks to the last version of a given distfile that was seen; it would be a massive directory, but it would work for most cases as well. I'm trying to do mentally is map that symlink model into a S3-API [not AWS-specific] model (specifically w/ x-amz-website-redirect-location), so that we can have a clean two-way archive<->host mapping. S3 pricing is good for a backup copy of this data, but the bandwidth is much too expensive for the long-term pricing side. Building out service would likely require the S3 side, and then the host side. The cost implications of storing inside S3 redirects worries me a little: the per-day directories, filled with symlinks: 51k objects/day in S3, $7.50/mo in request fees just for adding a prefix each day. > That aside. I assume your own archives go back to August of 2015 @robbat? Per earlier in the thread, my tapes cover much of the early history, back to 2003 > Gentoo's history spans almost a quarter of a century at this point if we > include Enochs 1999 release. The further back we go the more difficult it's > likely going to be to find distfiles or ISO's, especially for non-x86/AMD64 > architectures
(In reply to Robin Johnson from comment #11) > Here's a copy of the files for you: > https://dev.gentoo.org/~robbat2/distfiles/rxvt-unicode-9.22.tar.bz2 > https://dev.gentoo.org/~robbat2/distfiles/rxvt-unicode-9.22_24-bit- > color_cpixl-20201108.patch.xz I already found a solution for my situation with rxvt-unicode by the time you replied, but thank you anyway! > But i'm wondering why you specifically need x11-terms/rxvt-unicode-9.22-r7 I was performing a gradual upgrade of my Gentoo system which wasn't updated for a few years, and 9.22-r7 happened to be the version of rxvt-unicode in one of the portage tree commits that I used during this process.
> The cost implications of storing inside S3 redirects worries me a little: > the per-day directories, filled with symlinks: 51k objects/day in S3, > $7.50/mo in request fees just for adding a prefix each day. Do we have an approximation of the size of a 2003-Present(?) Distfiles mirror? It may be cheaper (and easier) to have it hosted on an 'old fashioned' web server with some big spinning rust (think OVH or the like)
(In reply to Cursed Silicon from comment #19) > Do we have an approximation of the size of a 2003-Present(?) Distfiles > mirror? It may be cheaper (and easier) to have it hosted on an 'old > fashioned' web server with some big spinning rust (think OVH or the like) I indexed all the historical manifest hashes as part of an effort to collect all the distfiles. It looks like the whole collection is roughly 4.88 TiB as of sometime in the past couple weeks. That does include distfiles that are restricted (although I can't say what portion).
(In reply to Daniel M. Weeks from comment #20) > (In reply to Cursed Silicon from comment #19) > > Do we have an approximation of the size of a 2003-Present(?) Distfiles > > mirror? It may be cheaper (and easier) to have it hosted on an 'old > > fashioned' web server with some big spinning rust (think OVH or the like) > > I indexed all the historical manifest hashes as part of an effort to collect > all the distfiles. It looks like the whole collection is roughly 4.88 TiB as > of sometime in the past couple weeks. That does include distfiles that are > restricted (although I can't say what portion). That seems small. My prior distfile statistics say Aug 2015-Nov 2023 is 4.82TB. I'd put ballpark total size closer to 8TB, but I think there might be a lot of dedup, where we don't have an set of hashes right now to prove the old file is the same (e.g. only have MD5 hash and a much newer hash, no overlap of hash history). Did you use the conversion of the historical gentoo-x86 repo (not the current Git repo), and did it correctly have the missing category (one of the conversions missed an entire category because of historical CVS weirdness). I agree that a primary copy on Hetzner etc is sounding like the best deal; and just archive to S3.
Hi all, I've been slowly working through updating an old gentoo install that wasn't touched since early 2016 and missing distfiles has been one of the biggest challenges I've faced. There are a few other things that have also likely made this difficult: - stale https certificates required a custom FETCHCOMMAND with wget --no-check-certificate - the reorganized distfiles layout that older gentoo does not know about, I'm assuming that some of the files required are probably in existing mirrors but can't be found under their old names / original locations There is a decent set of historical distfiles at https://www.jabawok.net/gentoo that has really helped me but I can imagine this being much more difficult for older systems. I just found the rxvt unicode distfile here https://bugs.gentoo.org/834712#c11 which will allow me to keep going for the moment.
(In reply to Derek Scherger from comment #22) > Hi all, > [snip] > > There is a decent set of historical distfiles at > https://www.jabawok.net/gentoo that has really helped me but I can imagine > this being much more difficult for older systems. > > I just found the rxvt unicode distfile here > https://bugs.gentoo.org/834712#c11 which will allow me to keep going for the > moment. Feel free to add http://bloodnoc.org/~roy/olde-distfiles/ to your GENTOO_MIRRORS. It works for http and https which avoids the certificates problem. https://wiki.gentoo.org/wiki/User:NeddySeagoon/HOWTO_Update_Old_Gentoo may be useful too.
(In reply to Robin Johnson from comment #21) > (In reply to Daniel M. Weeks from comment #20) > > (In reply to Cursed Silicon from comment #19) > > > Do we have an approximation of the size of a 2003-Present(?) Distfiles > > > mirror? It may be cheaper (and easier) to have it hosted on an 'old > > > fashioned' web server with some big spinning rust (think OVH or the like) > > > > I indexed all the historical manifest hashes as part of an effort to collect > > all the distfiles. It looks like the whole collection is roughly 4.88 TiB as > > of sometime in the past couple weeks. That does include distfiles that are > > restricted (although I can't say what portion). > > That seems small. My prior distfile statistics say Aug 2015-Nov 2023 is > 4.82TB. > I'd put ballpark total size closer to 8TB, but I think there might be a lot > of dedup, where we don't have an set of hashes right now to prove the old > file is the same (e.g. only have MD5 hash and a much newer hash, no overlap > of hash history). > > Did you use the conversion of the historical gentoo-x86 repo (not the > current Git repo), and did it correctly have the missing category (one of > the conversions missed an entire category because of historical CVS > weirdness). > > > I agree that a primary copy on Hetzner etc is sounding like the best deal; > and just archive to S3. I am correlating the hashes. I thought I had pulled in gentoo historical but there must have been some issue along the way. I ran through gentoo-historical-2 and it picked up a lot of "new" things. New collection total is ~6.2 TiB.
Brief update here: The drives are on order, and should arrive at OSUSL late this week or early next week. We should try to get some content-addressing symlink-layout script that we can hand out to data hoarders we want to ingest; specifically so that we can identify what unique content they each have. This will matter most for the largest archives, where it would be ideal to avoid duplicate downloads when most of the archive overlaps.
>We should try to get some content-addressing symlink-layout script that we can hand out to data hoarders Is the current (un-deduped) expectation still around 8TB? I'd like to get some drives ordered and ready on my own side
There were some delays getting the drives live, but they are now installed and fully working after of 2024/04/11. In terms of scouting out the content, and getting much closer to an estimate: repo R1: "modern" git.gentoo.org/repo/gentoo.git repo R2: "historical" git.gentoo.org/archive/repo/gentoo-2.git This is the best conversion of historical CVS. Notably it dealt with some painful cases that the previous conversion did not, like the package dev-backup/Attic "Attic" is a reserved word in CVS. Here's the patterns I've checked so far: $R1:$CAT/$PN/Manifest* -- lookup for compression! $R2:$CAT/$PN/Manifest* -- lookup for compression! $R2:$CAT/$PN/files/digest-$PN-$PF $R2:$CAT/$PN/files/digest $R2:$CAT/$PN/digest-$PN-$PF - rare, from a bug that omitted /files/ in the generated path; hopefully also duplicates. The above form ~885k unique "DIST" lines, including converting the older digest format. Those lines represent 702377 unique filenames of distfiles. I haven't tried to join this working set down to unique hashes yet. De-duping the above based on *filename* only, that's 6.7TiB/7.4TB of content; including everything that was restricted and might never have been on the web (e.g. distfiles only available on CDROMs, like games or commercial packages). Not included: $R2:$PN/files/digest - extremely rare, likely pre-CVS $R2:$PN/digest - extremely rare, likely pre-CVS AND bug Everything for Gentoo Prefix So related question: what filesystem setup are we doing to use to handle 1M files AND 50M-200M symlinks. Let the bikeshed debate begin; I think XFS vs ZFS vs btrfs to start with. back of napkin estimates: ext4 @ 4K inodes would need 190GiB for 50M symlinks.
I'll have a wee nibble > So related question: what filesystem setup are we doing to use to handle 1M > files AND 50M-200M symlinks. Let the bikeshed debate begin; I think XFS vs > ZFS vs btrfs to start with. > ZFS is not in the kernel - does it pass the Social Contract test? As Gentoo will not 'depend' on these archives, in my opinion, probably yes. BTRFS is still new. We see bugs on the forums still. On that basis, XFS is the least worst but I know nothing of the technical merits of any of them and the above are more points to ponder than reasons for recommendation. > back of napkin estimates: > ext4 @ 4K inodes would need 190GiB for 50M symlinks. With ext4 at 1k block size, its 'only' 25G. Putting a 1k black size fs on top of a device with a 4k physical block size is in general, a horrible thing to to but it works and for something that will be read mostly, do extra writes matter? Does the extra 175G matter in 7TB of data as its small in comparison to the total content. From an architectural perspective, is this to be one huge filesystem or several smaller filesystems, possibly with different filesystem types? More questions than answers.
While I really applaud this effort, I would like to remind you that we're still waiting for a proper developer hosting for distfiles (bug #176186). At this point, we're irrevocably discarding (or losing) old, custom-made distfiles, because of limited space on woodpecker. If our goal is to save distfiles, wouldn't it be better to solve that problem first?