761688 – In /etc/portage/repos.conf.d/ support a reference-repository option that accepts a list of filesystem paths

Bug 761688 - In /etc/portage/repos.conf.d/ support a reference-repository option that accepts a list of filesystem paths

Summary: In /etc/portage/repos.conf.d/ support a reference-repository option that acce...

Status:	UNCONFIRMED

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Enhancement/Feature Requests (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Portage team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	240187
	Show dependency tree

Reported:	2020-12-25 20:40 UTC by Michael Jones
Modified:	2021-02-24 22:22 UTC (History)
CC List:	3 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Jones 2020-12-25 20:40:23 UTC

Git natively supports the concept of a reference repository, that has a collection of git object chunks which can be used to seamlessly avoid internet access when the needed chunk is available in the reference repository.

This is useful not only for initial cloning, but also for syncing, assuming the reference repository is synced periodically.

Use case:

A build system that created dd-able images by bootstrapping a gentoo chroot and running emerge inside that chroot would involve cloning the gentoo portage repository according to the settings in the chroot's repos.conf.d/ directory.

With support for reference repositories, the initial repository clone would potentially involve no internet downloads for object chunks, only internet access to determine the latest git commit and metadata.

For a build machine that is creating a new build several times per day, this can represent several hundred megabytes of data usage saved.

For rsync repositories, this flag translates to either --copy-dest or --link-dest (depending on specific implementation, or possibly other flags), which provides similar bandwidth savings.

Reproducible: Always

Comment 1 Michael Jones 2020-12-25 20:49:37 UTC

This functionality would be enhanced by the addition of a pre-sync hook that can be used to optionally synchronize the reference repository(s) prior to the actual repository being initially cloned or synced.

For the build machine example, this is useful, because it allows the reference repository to be kept up to date, which allows for the initial repository clone of the chroot environment to skip downloading any chunks from the internet, since the reference repository is fully updated.

Comment 2 Kent Fredric (IRC: kent\n) (RETIRED) gentoo-dev

2020-12-26 09:15:01 UTC

My experience using this functionality directly had problems when used in conjuction with intermixing sync/gentoo.git and gentoo.git

Even though they share 50% history, "something" goes wrong on a nearly daily basis, which I can only assume some problem in the "common ancestor" logic getting confused for fetch, and this degrades into a "no common ancestor" situation, and attempts fetching all objects from scratch, which turns out to be more problematic than simply keeping the repositories distinct.

Once I abandoned this approach, my regular problems with git bailing due to fetching >1G of data and getting the data stream corrupted in the process (due to it taking so long), simply went away.

Its still a very useful tool to have in your arsenal though, but this caveat must be stated that it doesn't work unless all clones are logically parents/children of each other.

I've used it quite successfully in my rust testing toolchain, which eliminated huge swathes of network *and* disk IO, while also making the "cloned" repository free from interlock problems when multiple workers are concurrently trying to "update" their own clone. (Multiple calls to git fetch on the same repo is very problematic and fails repeatedly, disk-bound IO races aren't fun ever).

I think the "discipline" required is that when A is ref-cloned to B, that A is never updated while B exists.

This gives you CoW semantics, allowing updates to B, without requiring them synced back to A, and then B can be disposed, and A updated on its own.

( You can still update A when B exists, but doing so creates a time risk if the repos diverge too far )

Comment 3 Zac Medico gentoo-dev

2020-12-26 23:15:40 UTC

(In reply to Michael Jones from comment #0)
> Git natively supports the concept of a reference repository, that has a
> collection of git object chunks which can be used to seamlessly avoid
> internet access when the needed chunk is available in the reference
> repository.

Can you use the sync-git-clone-extra-opts option for this?

> This is useful not only for initial cloning, but also for syncing, assuming
> the reference repository is synced periodically.

Since the reference repository is sticky, a change to sync-git-clone-extra-opts  should be enough, so there would be no need to modify sync-git-pull-extra-opts.