Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 756778 - sys-apps/portage: add content-hash distfiles layout
Summary: sys-apps/portage: add content-hash distfiles layout
Status: RESOLVED FIXED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All All
: Normal enhancement
Assignee: Portage team
URL:
Whiteboard:
Keywords: InVCS
Depends on: 766117
Blocks: 377365 534528
  Show dependency tree
 
Reported: 2020-11-26 04:32 UTC by Zac Medico
Modified: 2021-03-31 20:52 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zac Medico gentoo-dev 2020-11-26 04:32:17 UTC
The content-hash distfiles layout will be similar to filename-hash, except that paths will be derived entirely from a file content hash.
Comment 1 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2020-11-26 08:01:48 UTC
What's the advantage?
Comment 2 Zac Medico gentoo-dev 2020-11-26 20:21:13 UTC
The content-hash layout has a few advantages over the filename-hash layout:

1) Since the file path is independent of the file name, file name collisions cannot occur. This makes the content-hash layout suitable for storage of multiple types of files (not only gentoo distfiles). For example, it can be used to store distfiles for multiple linux distros within the same tree, with automatic deduplication based on content digest. This layout can be used to store and distribute practically anything (including binary packages for example).

2) Allows multiple revisions for the same distfiles name. An existing distfile can be updated, and if a user still has an older copy of an ebuild repository (or an overlay), then a user can successfully fetch a desired revision of the distfile as long as it has not been purged from the mirror.

3) File integrity data is integrated into the layout itself, making it very simple to verify the integrity of any file that it contains. The only tool required is an implementation of the chosen hash algorithm.
Comment 4 Larry the Git Cow gentoo-dev 2021-02-22 12:18:40 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=a4f06ab3cf7339100b2af2146ae90cbba8bac371

commit a4f06ab3cf7339100b2af2146ae90cbba8bac371
Author:     Daniel Robbins <drobbins@funtoo.org>
AuthorDate: 2021-02-20 23:11:46 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2021-02-22 11:48:41 +0000

    Add content-hash distfiles layout (bug 756778)
    
    The content-hash layout is identical to the filename-hash layout,
    except for these three differences:
    
    1) A content digest is used instead of a filename digest.
    
    2) The final element of the path returned from the get_path method
    corresponds to the complete content digest. The path is a function
    of the content digest alone.
    
    3) Because the path is a function of content digest alone, the
    get_filenames implementation cannot derive distfiles names from
    paths, so it instead yields DistfileName instances whose names are
    equal to content digest values. The DistfileName documentation
    discusses resulting implications.
    
    Motivations to use the content-hash layout instead of the
    filename-hash layout may include:
    
    1) Since the file path is independent of the file name, file
    name collisions cannot occur. This makes the content-hash
    layout suitable for storage of multiple types of files (not
    only gentoo distfiles). For example, it can be used to store
    distfiles for multiple linux distros within the same tree,
    with automatic deduplication based on content digest. This
    layout can be used to store and distribute practically anything
    (including binary packages for example).
    
    2) Allows multiple revisions for the same distfiles name. An
    existing distfile can be updated, and if a user still has an
    older copy of an ebuild repository (or an overlay), then a user
    can successfully fetch a desired revision of the distfile as
    long as it has not been purged from the mirror.
    
    3) File integrity data is integrated into the layout itself,
    making it very simple to verify the integrity of any file that
    it contains. The only tool required is an implementation of
    the chosen hash algorithm.
    
    Bug: https://bugs.gentoo.org/756778
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 lib/portage/package/ebuild/fetch.py    | 97 ++++++++++++++++++++++++++++++++++
 lib/portage/tests/ebuild/test_fetch.py | 36 +++++++++++++
 2 files changed, 133 insertions(+)

https://gitweb.gentoo.org/proj/portage.git/commit/?id=b9ef191c74982b0e8d837aa7dd256dc3c52f7d2c

commit b9ef191c74982b0e8d837aa7dd256dc3c52f7d2c
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2021-02-20 23:11:46 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2021-02-22 11:48:41 +0000

    MirrorLayoutConfig: content digest support (bug 756778)
    
    In order to support mirror layouts that use content
    digests, extend MirrorLayoutConfig validate_structure and
    get_best_supported_layout methods to support an optional
    filename parameter of type DistfileName which includes a digests
    attribute. Use the new parameter to account for availablility
    of specific distfile content digests when validating and selecting
    mirror layouts which require those digests.
    
    The DistfileName type represents a distfile name and associated
    content digests, used by MirrorLayoutConfig and related layout
    implementations.
    
    The path of a distfile within a layout must be dependent on
    nothing more than the distfile name and its associated content
    digests. For filename-hash layout, path is dependent on distfile
    name alone, and the get_filenames implementation yields strings
    corresponding to distfile names. For content-hash layout, path is
    dependent on content digest alone, and the get_filenames
    implementation yields DistfileName instances whose names are equal
    to content digest values. The content-hash layout simply lacks
    the filename-hash layout's innate ability to translate a distfile
    path to a distfile name, and instead caries an innate ability
    to translate a distfile path to a content digest.
    
    In order to prepare for a migration from filename-hash to
    content-hash layout, all consumers of the layout get_filenames
    method need to be updated to work with content digests as a
    substitute for distfile names. For example, in order to prepare
    emirrordist for content-hash, a key-value store needs to be
    added as a means to associate distfile names with content
    digest values yielded by the content-hash get_filenames
    implementation.
    
    Bug: https://bugs.gentoo.org/756778
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 lib/portage/package/ebuild/fetch.py    | 98 ++++++++++++++++++++++++++++++----
 lib/portage/tests/ebuild/test_fetch.py | 33 +++++++++---
 2 files changed, 114 insertions(+), 17 deletions(-)
Comment 5 Larry the Git Cow gentoo-dev 2021-02-22 13:54:39 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=6b98e103aa15da331d647a8f65a45bb3bb4e3197

commit 6b98e103aa15da331d647a8f65a45bb3bb4e3197
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2021-02-22 13:46:04 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2021-02-22 13:54:29 +0000

    sys-apps/portage: Bump to version 3.0.15
    
     #715112 default enable FEATURES=binpkg-multi-instance
     #756778 content-hash distfiles layout
     #766459 emirrordist: prevent distfiles_db _pkg_str pickle problems
     #766767 emaint --fix merges: add -y, --yes option
     #766773 emerge: disable --autounmask-license by default
     #767913 portage.getpid: call os.getpid() lazily
     #770712 PopenProcess: use call_soon for _async_waipid in _start
     #771549 prevent USE="${USE} ..." misbehavior
    
    Bug: https://bugs.gentoo.org/766117
    Bug: https://bugs.gentoo.org/715112
    Bug: https://bugs.gentoo.org/756778
    Bug: https://bugs.gentoo.org/766459
    Bug: https://bugs.gentoo.org/766767
    Bug: https://bugs.gentoo.org/766773
    Bug: https://bugs.gentoo.org/767913
    Bug: https://bugs.gentoo.org/770712
    Bug: https://bugs.gentoo.org/771549
    Package-Manager: Portage-3.0.15, Repoman-3.0.2
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 sys-apps/portage/Manifest              |   1 +
 sys-apps/portage/portage-3.0.15.ebuild | 268 +++++++++++++++++++++++++++++++++
 2 files changed, 269 insertions(+)
Comment 6 Larry the Git Cow gentoo-dev 2021-02-24 21:07:11 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=4845fbacbe1021be6de8a4ea5f8e21be9c0ac6e0

commit 4845fbacbe1021be6de8a4ea5f8e21be9c0ac6e0
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2021-02-24 21:01:49 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2021-02-24 21:05:36 +0000

    FetchIterator: pass DistfileName type as FetchTask filename
    
    Bug: https://bugs.gentoo.org/756778
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 lib/portage/_emirrordist/FetchIterator.py |  3 ++-
 lib/portage/tests/ebuild/test_fetch.py    | 18 +++++++++---------
 2 files changed, 11 insertions(+), 10 deletions(-)
Comment 7 Zac Medico gentoo-dev 2021-02-25 01:31:21 UTC
This patch adds content-hash support to emirrordist:

https://archives.gentoo.org/gentoo-portage-dev/message/1f5b9202d5ff284902c15d3cc2700b1e
https://github.com/gentoo/portage/pull/676
Comment 8 Larry the Git Cow gentoo-dev 2021-02-27 07:52:49 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=fd04c5fb1619f86381b5d5e6ff66b20fa3967c43

commit fd04c5fb1619f86381b5d5e6ff66b20fa3967c43
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2021-02-24 19:56:38 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2021-02-27 07:43:23 +0000

    emirrordist: add --content-db option required for content-hash layout (bug 756778)
    
    Add a --content-db option which is required for the content-hash
    layout because its file listings return content digests instead of
    distfile names.
    
    The content db serves to translate content digests to distfiles
    names, and distfiles names to content digests. All keys have one or
    more prefixes separated by colons. For a digest key, the first
    prefix is "digest" and the second prefix is the hash algorithm name.
    For a filename key, the prefix is "filename".
    
    The value associated with a digest key is a set of file names. The
    value associated with a distfile key is a set of content revisions.
    Each content revision is expressed as a dictionary of digests which
    is suitable for construction of a DistfileName instance.
    
    A given content digest will translate to multiple distfile names if
    multiple associations have been created via the content db add
    method. The relationship between a content digest and a distfile
    name is similar to the relationship between an inode and a hardlink.
    
    Bug: https://bugs.gentoo.org/756778
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 lib/portage/_emirrordist/Config.py           |   6 +
 lib/portage/_emirrordist/ContentDB.py        | 196 +++++++++++++++++++++++++++
 lib/portage/_emirrordist/DeletionIterator.py |  25 +++-
 lib/portage/_emirrordist/DeletionTask.py     |   8 ++
 lib/portage/_emirrordist/FetchTask.py        |   5 +-
 lib/portage/_emirrordist/main.py             |  15 +-
 lib/portage/package/ebuild/fetch.py          |   8 +-
 lib/portage/tests/ebuild/test_fetch.py       | 148 ++++++++++++++++++++
 man/emirrordist.1                            |   6 +-
 9 files changed, 407 insertions(+), 10 deletions(-)