The content-hash distfiles layout will be similar to filename-hash, except that paths will be derived entirely from a file content hash.
What's the advantage?
The content-hash layout has a few advantages over the filename-hash layout: 1) Since the file path is independent of the file name, file name collisions cannot occur. This makes the content-hash layout suitable for storage of multiple types of files (not only gentoo distfiles). For example, it can be used to store distfiles for multiple linux distros within the same tree, with automatic deduplication based on content digest. This layout can be used to store and distribute practically anything (including binary packages for example). 2) Allows multiple revisions for the same distfiles name. An existing distfile can be updated, and if a user still has an older copy of an ebuild repository (or an overlay), then a user can successfully fetch a desired revision of the distfile as long as it has not been purged from the mirror. 3) File integrity data is integrated into the layout itself, making it very simple to verify the integrity of any file that it contains. The only tool required is an implementation of the chosen hash algorithm.
Patch posted for review: https://archives.gentoo.org/gentoo-portage-dev/message/a9f4cb06587083c90605b0b93a25ed02 https://github.com/gentoo/portage/pull/671
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=a4f06ab3cf7339100b2af2146ae90cbba8bac371 commit a4f06ab3cf7339100b2af2146ae90cbba8bac371 Author: Daniel Robbins <drobbins@funtoo.org> AuthorDate: 2021-02-20 23:11:46 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2021-02-22 11:48:41 +0000 Add content-hash distfiles layout (bug 756778) The content-hash layout is identical to the filename-hash layout, except for these three differences: 1) A content digest is used instead of a filename digest. 2) The final element of the path returned from the get_path method corresponds to the complete content digest. The path is a function of the content digest alone. 3) Because the path is a function of content digest alone, the get_filenames implementation cannot derive distfiles names from paths, so it instead yields DistfileName instances whose names are equal to content digest values. The DistfileName documentation discusses resulting implications. Motivations to use the content-hash layout instead of the filename-hash layout may include: 1) Since the file path is independent of the file name, file name collisions cannot occur. This makes the content-hash layout suitable for storage of multiple types of files (not only gentoo distfiles). For example, it can be used to store distfiles for multiple linux distros within the same tree, with automatic deduplication based on content digest. This layout can be used to store and distribute practically anything (including binary packages for example). 2) Allows multiple revisions for the same distfiles name. An existing distfile can be updated, and if a user still has an older copy of an ebuild repository (or an overlay), then a user can successfully fetch a desired revision of the distfile as long as it has not been purged from the mirror. 3) File integrity data is integrated into the layout itself, making it very simple to verify the integrity of any file that it contains. The only tool required is an implementation of the chosen hash algorithm. Bug: https://bugs.gentoo.org/756778 Signed-off-by: Zac Medico <zmedico@gentoo.org> lib/portage/package/ebuild/fetch.py | 97 ++++++++++++++++++++++++++++++++++ lib/portage/tests/ebuild/test_fetch.py | 36 +++++++++++++ 2 files changed, 133 insertions(+) https://gitweb.gentoo.org/proj/portage.git/commit/?id=b9ef191c74982b0e8d837aa7dd256dc3c52f7d2c commit b9ef191c74982b0e8d837aa7dd256dc3c52f7d2c Author: Zac Medico <zmedico@gentoo.org> AuthorDate: 2021-02-20 23:11:46 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2021-02-22 11:48:41 +0000 MirrorLayoutConfig: content digest support (bug 756778) In order to support mirror layouts that use content digests, extend MirrorLayoutConfig validate_structure and get_best_supported_layout methods to support an optional filename parameter of type DistfileName which includes a digests attribute. Use the new parameter to account for availablility of specific distfile content digests when validating and selecting mirror layouts which require those digests. The DistfileName type represents a distfile name and associated content digests, used by MirrorLayoutConfig and related layout implementations. The path of a distfile within a layout must be dependent on nothing more than the distfile name and its associated content digests. For filename-hash layout, path is dependent on distfile name alone, and the get_filenames implementation yields strings corresponding to distfile names. For content-hash layout, path is dependent on content digest alone, and the get_filenames implementation yields DistfileName instances whose names are equal to content digest values. The content-hash layout simply lacks the filename-hash layout's innate ability to translate a distfile path to a distfile name, and instead caries an innate ability to translate a distfile path to a content digest. In order to prepare for a migration from filename-hash to content-hash layout, all consumers of the layout get_filenames method need to be updated to work with content digests as a substitute for distfile names. For example, in order to prepare emirrordist for content-hash, a key-value store needs to be added as a means to associate distfile names with content digest values yielded by the content-hash get_filenames implementation. Bug: https://bugs.gentoo.org/756778 Signed-off-by: Zac Medico <zmedico@gentoo.org> lib/portage/package/ebuild/fetch.py | 98 ++++++++++++++++++++++++++++++---- lib/portage/tests/ebuild/test_fetch.py | 33 +++++++++--- 2 files changed, 114 insertions(+), 17 deletions(-)
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=6b98e103aa15da331d647a8f65a45bb3bb4e3197 commit 6b98e103aa15da331d647a8f65a45bb3bb4e3197 Author: Zac Medico <zmedico@gentoo.org> AuthorDate: 2021-02-22 13:46:04 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2021-02-22 13:54:29 +0000 sys-apps/portage: Bump to version 3.0.15 #715112 default enable FEATURES=binpkg-multi-instance #756778 content-hash distfiles layout #766459 emirrordist: prevent distfiles_db _pkg_str pickle problems #766767 emaint --fix merges: add -y, --yes option #766773 emerge: disable --autounmask-license by default #767913 portage.getpid: call os.getpid() lazily #770712 PopenProcess: use call_soon for _async_waipid in _start #771549 prevent USE="${USE} ..." misbehavior Bug: https://bugs.gentoo.org/766117 Bug: https://bugs.gentoo.org/715112 Bug: https://bugs.gentoo.org/756778 Bug: https://bugs.gentoo.org/766459 Bug: https://bugs.gentoo.org/766767 Bug: https://bugs.gentoo.org/766773 Bug: https://bugs.gentoo.org/767913 Bug: https://bugs.gentoo.org/770712 Bug: https://bugs.gentoo.org/771549 Package-Manager: Portage-3.0.15, Repoman-3.0.2 Signed-off-by: Zac Medico <zmedico@gentoo.org> sys-apps/portage/Manifest | 1 + sys-apps/portage/portage-3.0.15.ebuild | 268 +++++++++++++++++++++++++++++++++ 2 files changed, 269 insertions(+)
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=4845fbacbe1021be6de8a4ea5f8e21be9c0ac6e0 commit 4845fbacbe1021be6de8a4ea5f8e21be9c0ac6e0 Author: Zac Medico <zmedico@gentoo.org> AuthorDate: 2021-02-24 21:01:49 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2021-02-24 21:05:36 +0000 FetchIterator: pass DistfileName type as FetchTask filename Bug: https://bugs.gentoo.org/756778 Signed-off-by: Zac Medico <zmedico@gentoo.org> lib/portage/_emirrordist/FetchIterator.py | 3 ++- lib/portage/tests/ebuild/test_fetch.py | 18 +++++++++--------- 2 files changed, 11 insertions(+), 10 deletions(-)
This patch adds content-hash support to emirrordist: https://archives.gentoo.org/gentoo-portage-dev/message/1f5b9202d5ff284902c15d3cc2700b1e https://github.com/gentoo/portage/pull/676
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=fd04c5fb1619f86381b5d5e6ff66b20fa3967c43 commit fd04c5fb1619f86381b5d5e6ff66b20fa3967c43 Author: Zac Medico <zmedico@gentoo.org> AuthorDate: 2021-02-24 19:56:38 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2021-02-27 07:43:23 +0000 emirrordist: add --content-db option required for content-hash layout (bug 756778) Add a --content-db option which is required for the content-hash layout because its file listings return content digests instead of distfile names. The content db serves to translate content digests to distfiles names, and distfiles names to content digests. All keys have one or more prefixes separated by colons. For a digest key, the first prefix is "digest" and the second prefix is the hash algorithm name. For a filename key, the prefix is "filename". The value associated with a digest key is a set of file names. The value associated with a distfile key is a set of content revisions. Each content revision is expressed as a dictionary of digests which is suitable for construction of a DistfileName instance. A given content digest will translate to multiple distfile names if multiple associations have been created via the content db add method. The relationship between a content digest and a distfile name is similar to the relationship between an inode and a hardlink. Bug: https://bugs.gentoo.org/756778 Signed-off-by: Zac Medico <zmedico@gentoo.org> lib/portage/_emirrordist/Config.py | 6 + lib/portage/_emirrordist/ContentDB.py | 196 +++++++++++++++++++++++++++ lib/portage/_emirrordist/DeletionIterator.py | 25 +++- lib/portage/_emirrordist/DeletionTask.py | 8 ++ lib/portage/_emirrordist/FetchTask.py | 5 +- lib/portage/_emirrordist/main.py | 15 +- lib/portage/package/ebuild/fetch.py | 8 +- lib/portage/tests/ebuild/test_fetch.py | 148 ++++++++++++++++++++ man/emirrordist.1 | 6 +- 9 files changed, 407 insertions(+), 10 deletions(-)