Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 552814 - sys-apps/portage: support shallow git pull by setting sync-depth = 1 in repos.conf
Summary: sys-apps/portage: support shallow git pull by setting sync-depth = 1 in repos...
Status: RESOLVED FIXED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core - External Interaction (show other bugs)
Hardware: All Linux
: Normal enhancement
Assignee: Portage team
URL:
Whiteboard:
Keywords: InVCS
: 568890 (view as bug list)
Depends on: 568934 599008
Blocks: 659322
  Show dependency tree
 
Reported: 2015-06-22 09:11 UTC by Massimo Burcheri
Modified: 2023-12-24 17:53 UTC (History)
12 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
refresh index before git reset (0001-GitSync.update-refresh-index-before-git-reset-bug-55.patch,1.17 KB, patch)
2016-07-17 04:58 UTC, Zac Medico
Details | Diff
refresh index before git reset (0001-GitSync.update-refresh-index-before-git-reset-bug-55.patch,1.18 KB, patch)
2016-07-17 05:11 UTC, Zac Medico
Details | Diff
always pass -q to git-update-index (0001-git-always-pass-q-to-git-update-index.patch,945 bytes, patch)
2016-11-02 17:13 UTC, Mike Gilbert
Details | Diff
always pass -q to git-update-index (0001-git-always-pass-q-to-git-update-index.patch,1.06 KB, patch)
2016-11-02 17:18 UTC, Mike Gilbert
Details | Diff
always pass -q to git-update-index (0001-sync-always-pass-q-to-git-update-index.patch,1.21 KB, patch)
2016-11-02 17:28 UTC, Mike Gilbert
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Massimo Burcheri 2015-06-22 09:11:21 UTC
Similar to bug 500358 for layman....

Initially the sync-type git is creating a shallow clone with clone --depth=1, however the following sync to not.
--depth=1 for both would be appropriate for the tree.

Reproducible: Always
Comment 2 Massimo Burcheri 2015-06-23 06:28:36 UTC
I'm on sys-apps/portage-2.2.20:
Doing the first clone I have 1 history item. Later after emerge --sync I have several other history items. Doing a git pull --depth=1 again on $PORTDIR I have again 1 single history item. So I wonder if emerge --sync does use --depth=1.

The output says:
# emerge --sync
>>> Syncing repository 'gentoo' into '/usr/portage'...
/usr/bin/git pull
Already up-to-date.
Comment 3 Zac Medico gentoo-dev 2015-06-23 22:18:34 UTC
It seems that it only passes the git --depth option for clone operations. It needs to be fixed to do it for pull operations too.
Comment 4 Sebastian Pipping gentoo-dev 2016-01-03 16:38:32 UTC
Please note the problems with "git pull --depth 1" mentioned at
https://bugs.gentoo.org/show_bug.cgi?id=568890#c1

I am happy if we close either bug as a duplicate.
Comment 5 Brian Dolbec (RETIRED) gentoo-dev 2016-01-03 16:58:58 UTC
*** Bug 568890 has been marked as a duplicate of this bug. ***
Comment 6 Massimo Burcheri 2016-01-22 07:22:24 UTC
As bug 568890 has been marked as a duplicate, moving discussion here...

https://bugs.gentoo.org/show_bug.cgi?id=568890#c1
Zac, what you noticed on that comment, I almost every time encounter on layman:
https://bugs.gentoo.org/show_bug.cgi?id=500358#c3

As of today both portage and layman are syncing by git, so what is the way to do the sync/pull, to get both finally fixed?
Syncing portage I have not seen this issue yet.
Comment 7 Devan Franchini (RETIRED) gentoo-dev 2016-07-10 23:04:47 UTC

*** This bug has been marked as a duplicate of bug 500358 ***
Comment 8 Devan Franchini (RETIRED) gentoo-dev 2016-07-10 23:05:34 UTC
Reverting back on that, my apologies.
Comment 9 Zac Medico gentoo-dev 2016-07-10 23:17:13 UTC
(In reply to Massimo Burcheri from comment #6)
> As of today both portage and layman are syncing by git, so what is the way
> to do the sync/pull, to get both finally fixed?

I think it's reasonable to have them do `git fetch --depth=1` followed by `git reset --hard origin/master`, as discussed in bug 568890.

> Syncing portage I have not seen this issue yet.

That's because portage doesn't try to do a pull with --depth=1, because the default merge strategy will inevitably fail. We can make it do the `git fetch --depth=1 && git reset --hard origin/master` thing as soon as we merge the patch from bug 568934.
Comment 10 Zac Medico gentoo-dev 2016-07-13 11:36:48 UTC
The patch for bug 568934 is in the master branch now.
Comment 12 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-07-14 12:46:48 UTC
Are you sure that this works as expected? Back when I fiddled with those things, 'git fetch --depth=1' meant *refetching all the objects* as if you were doing 'git clone --depth=1'. Including those objects you already have. In other words, much, much more than regular 'git fetch' that only fetched new objects.
Comment 13 Zac Medico gentoo-dev 2016-07-14 15:19:54 UTC
Just now, I've tested fetch with and without --depth=1, on a repository with full history. I used a btrfs snapshot to ensure that the initial state was identical for both fetches:

Test #1:

$ git fetch origin --depth=1
remote: Counting objects: 492, done.
remote: Compressing objects: 100% (110/110), done.
remote: Total 492 (delta 436), reused 423 (delta 381)
Receiving objects: 100% (492/492), 158.36 KiB | 0 bytes/s, done.
Resolving deltas: 100% (436/436), completed with 426 local objects.
From git+ssh://git.gentoo.org/repo/gentoo
 + 415df80...0004f2c master     -> origin/master  (forced update)

Test #2 (initial state identical to test #1):

$ git fetch origin
remote: Counting objects: 595, done.
remote: Compressing objects: 100% (206/206), done.
remote: Total 595 (delta 505), reused 434 (delta 388)
Receiving objects: 100% (595/595), 193.06 KiB | 0 bytes/s, done.
Resolving deltas: 100% (505/505), completed with 427 local objects.
From git+ssh://git.gentoo.org/repo/gentoo
   415df80..0004f2c  master     -> origin/master

So, apparently --depth=1 reduced the total number of objects, but objects were still reused.

$ git --version
git version 2.7.4
Comment 14 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-07-14 15:24:59 UTC
Could you also test https:// and git:// while you're at it? It's quite possible they fixed it in a newer release. It's been a while since I've been working on git-2.
Comment 15 Zac Medico gentoo-dev 2016-07-14 15:46:50 UTC
Same results with https:// and git:// protocols from anongit.gentoo.org:

$ git fetch origin --depth=1
remote: Counting objects: 497, done.
remote: Compressing objects: 100% (115/115), done.
remote: Total 497 (delta 438), reused 423 (delta 381)
Receiving objects: 100% (497/497), 162.98 KiB | 0 bytes/s, done.
Resolving deltas: 100% (438/438), completed with 427 local objects.
From https://anongit.gentoo.org/git/repo/gentoo
 + 415df80...40e4e30 master     -> origin/master  (forced update)

$ git fetch origin 
remote: Counting objects: 611, done.
remote: Compressing objects: 100% (222/222), done.
remote: Total 611 (delta 515), reused 434 (delta 388)
Receiving objects: 100% (611/611), 200.12 KiB | 0 bytes/s, done.
Resolving deltas: 100% (515/515), completed with 428 local objects.
From https://anongit.gentoo.org/git/repo/gentoo
   415df80..40e4e30  master     -> origin/master

$ git fetch origin --depth=1
remote: Counting objects: 497, done.
remote: Compressing objects: 100% (115/115), done.
remote: Total 497 (delta 438), reused 423 (delta 381)
Receiving objects: 100% (497/497), 162.98 KiB | 165.00 KiB/s, done.
Resolving deltas: 100% (438/438), completed with 427 local objects.
From git://anongit.gentoo.org/repo/gentoo
 + 415df80...40e4e30 master     -> origin/master  (forced update)

$ git fetch origin
remote: Counting objects: 611, done.
remote: Compressing objects: 100% (222/222), done.
remote: Total 611 (delta 515), reused 434 (delta 388)
Receiving objects: 100% (611/611), 200.12 KiB | 190.00 KiB/s, done.
Resolving deltas: 100% (515/515), completed with 428 local objects.
From git://anongit.gentoo.org/repo/gentoo
   415df80..40e4e30  master     -> origin/master

$ git --version
git version 2.7.4
Comment 17 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-07-16 11:54:59 UTC
Please revert this. You've caused a major performance regression and making a mess of the underlying file system. Normal 'git pull' used to check out only differening files. Now you're forcing re-checkout of every single file on every sync.
Comment 18 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-07-16 12:00:23 UTC
Not to mention it bumps mtimes of all files and causes all updates to be reapplied after every sync.
Comment 19 Zac Medico gentoo-dev 2016-07-16 12:25:16 UTC
(In reply to Michał Górny from comment #17)
> Please revert this. 

Maybe we should, but let's document the reasons well.

> You've caused a major performance regression and making
> a mess of the underlying file system.

What version of git are you using? Not sure what you mean by "mess" here. What type of file system are you using? Is it re-writing all of the files? Do the inode numbers change in the process? How about timestamps?

> Normal 'git pull' used to check out
> only differening files. Now you're forcing re-checkout of every single file
> on every sync.

I have observed cases where `git reset --hard` leaves the existing inode in place if it has not changed. We should document what causes which version of git to re-checkout the file (and with what file system if that matters).

(In reply to Michał Górny from comment #18)
> Not to mention it bumps mtimes of all files and causes all updates to be
> reapplied after every sync.

Okay, I suppose we could store md5 instead of mtime in the /var/cache/edb/mtimedb.
Comment 20 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-07-16 16:17:38 UTC
(In reply to Zac Medico from comment #19)
> (In reply to Michał Górny from comment #17)
> > Please revert this. 
> 
> Maybe we should, but let's document the reasons well.
> 
> > You've caused a major performance regression and making
> > a mess of the underlying file system.
> 
> What version of git are you using? Not sure what you mean by "mess" here.

2.9.1

> What type of file system are you using? Is it re-writing all of the files?
> Do the inode numbers change in the process? How about timestamps?

fuse.unionfs. Yes, it is re-writing all the files which is causing them to re-appear on the top filesystem. Which means it's terribly slow, and even device numbers change, and I have to repack them back into squashfs.

if you really insist on doing something insane, use at least 'git checkout' and not go straight for 'git reset --hard'. Maybe that would work saner.
Comment 21 Zac Medico gentoo-dev 2016-07-16 20:34:51 UTC
(In reply to Michał Górny from comment #20)
> fuse.unionfs. Yes, it is re-writing all the files which is causing them to
> re-appear on the top filesystem. Which means it's terribly slow, and even
> device numbers change, and I have to repack them back into squashfs.

Okay, I figured that you must be using something exotic like that. We can accommodate fuse.unionfs, but we need to recognize that fuse.unionfs users are probably a minority and that they may need to deviate from the default sync settings if there's difficulty getting it to behave well.

> if you really insist on doing something insane, use at least 'git checkout'
> and not go straight for 'git reset --hard'. Maybe that would work saner.

That's worth a try. I'll look into creating a patch for that.
Comment 22 Zac Medico gentoo-dev 2016-07-17 00:06:49 UTC
(In reply to Zac Medico from comment #21)
> > if you really insist on doing something insane, use at least 'git checkout'
> > and not go straight for 'git reset --hard'. Maybe that would work saner.
> 
> That's worth a try. I'll look into creating a patch for that.

I've tested using `git reset --soft origin/master && git checkout f` with unionfs-fuse-1.0 and git-2.7.4, but all of the files are still being re-written in the write layer. We might want to try overlayfs to see if git behaves any better with it.

Anyway, for now, I think we should advise unionfs-fuse users to set sync-depth = 0 in repos.conf.
Comment 23 Zac Medico gentoo-dev 2016-07-17 02:18:39 UTC
(In reply to Zac Medico from comment #22)
> (In reply to Zac Medico from comment #21)
> > > if you really insist on doing something insane, use at least 'git checkout'
> > > and not go straight for 'git reset --hard'. Maybe that would work saner.
> > 
> > That's worth a try. I'll look into creating a patch for that.
> 
> I've tested using `git reset --soft origin/master && git checkout f` with
> unionfs-fuse-1.0 and git-2.7.4, but all of the files are still being
> re-written in the write layer.

Actually, only the changed files are being re-written, but I get the same result with `git reset --hard origin/master`. So I'm getting good results either way. This is with btrfs for the RO layer and tmpfs for the RW layer. I wonder if your problem is a mismatch timestamp resolution between your RO layer and RW layer, since squashfs only has 1s timestamp resolution.
Comment 24 Zac Medico gentoo-dev 2016-07-17 04:40:29 UTC
For git reset --hard, I think that it only re-writes files when they differ from it entry in the index. There's some good explanations about how files are compared with the index here:

https://github.com/git/git/blob/master/Documentation/technical/racy-git.txt

Note the file will be recognized as "different" if its inode number is different from that recorded in the index, and obviously something like unionfs-fuse can easily make your inode numbers differ from those in the index.

It's possible that running `git reset-index` prior to `git reset --hard` will mitigate the issues with unionfs-fuse. I'll create a patch for you to test.
Comment 25 Zac Medico gentoo-dev 2016-07-17 04:58:01 UTC
Created attachment 440904 [details, diff]
refresh index before git reset

Hopefully this solves the unionfs-fuse issues.
Comment 26 Zac Medico gentoo-dev 2016-07-17 05:03:42 UTC
(In reply to Zac Medico from comment #25)
> Created attachment 440904 [details, diff] [details, diff]
> refresh index before git reset
> 
> Hopefully this solves the unionfs-fuse issues.

Maybe we should change --refresh to --really-refresh.
Comment 27 Zac Medico gentoo-dev 2016-07-17 05:11:55 UTC
Created attachment 440906 [details, diff]
refresh index before git reset

Updated to use the --really-refresh option.
Comment 28 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-07-17 05:24:44 UTC
Yes, add more hackery to the wrong concept to hopefully make it less wrong, and advise users not to use the defaults. That's the very Portage way.

So what's the problem with just checking out, *without* any resetting? Since no resetting should be necessary if you don't touch the checkout.
Comment 29 Zac Medico gentoo-dev 2016-07-17 05:28:08 UTC
(In reply to Michał Górny from comment #28)
> Yes, add more hackery to the wrong concept to hopefully make it less wrong,
> and advise users not to use the defaults. That's the very Portage way.
> 
> So what's the problem with just checking out, *without* any resetting? Since
> no resetting should be necessary if you don't touch the checkout.

If you don't like my patch, please provide your own.
Comment 30 Zac Medico gentoo-dev 2016-07-17 06:13:13 UTC
In speaking with Michał Górny on irc, he made it clear that he would prefer that sync-depth setting only affect the initial clone operation.

If that's how he want it, then I suggest that we rename it to clone-depth.
Comment 31 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-07-17 11:09:53 UTC
No, you didn't get my point at all. I don't mind --depth=1 as long as it is done properly, i.e. without resorting to ugly hacks that break more than fix.

Because this whole bug is in essence the typical result of Gentoo hackery. 'I wanted to do X but it doesn't seem to work -- so I'll try to run some random git commands until I get a result that seems to make X work, then I'll commit them without even caring to understand what they do a few hours later without even testing them more widely'.

What you caused is a major change in behavior that goes completely unexpected to our users:

1. All local commits are discarded without warning (previously the default was to attempt a merge),

2. All local non-committed changes are discarded without warning (previously git either did not care or complained that they collide with the update),

3. There is a major performance loss. Previously, git had to only care about changed files. Now it has to recheck all files in the repository, and -- as proven -- in some cases rewrites them all.

4. This is a minor problem but the repository is now full of dangling objects and the history is so broken you can't do pretty much anything with it without unshallowing the repository (i.e. fetching the few hundred megs of remaining data). Previously, you could cleanly work with the commits since first shallow clone.

And yes, I will try to find a way to get this to work. But it will take some time since I need to undo all the damage caused by last sync.
Comment 32 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-07-17 12:08:41 UTC
Performance comparison, for two week period from 2016-07-03 20:24:02 UTC to 2016-07-17 10:43:03 UTC.

- git pull (fast-forward):

  1m6.876s ; 6087 files created in rw snapshot (counting only visible files);

- git reset --hard origin/master

  2m8.060s ; 140302 files created in rw snapshot (likewise);

- git checkout origin/master

  1m3.047s ; 6087 files (i.e. the same as with 'git pull');

- git pull --depth=1 --rebase

  causes unionfs-fuse to segv but supposedly slower since it needs to construct and apply a patch.

So far I haven't found anything faster and more correct than 'git checkout origin/master; git branch -D master; git checkout -b master'.
Comment 33 Zac Medico gentoo-dev 2016-07-17 16:49:11 UTC
(In reply to Michał Górny from comment #31)
> Because this whole bug is in essence the typical result of Gentoo hackery.
> 'I wanted to do X but it doesn't seem to work -- so I'll try to run some
> random git commands until I get a result that seems to make X work, then
> I'll commit them without even caring to understand what they do a few hours
> later without even testing them more widely'.

Sorry, I can't test every filesystem that exists in the world. It works fine for me on btrfs. How was I supposed to know that it wouldn't work well on every filesystem?

> What you caused is a major change in behavior that goes completely
> unexpected to our users:
> 
> 1. All local commits are discarded without warning (previously the default
> was to attempt a merge),
> 
> 2. All local non-committed changes are discarded without warning (previously
> git either did not care or complained that they collide with the update),

Well, I think it's a pretty good assumption that sync-depth = 1 means that people expect essentially the same behavior as rsync, which means that local changes are simply discarded.

(In reply to Michał Górny from comment #32)
> So far I haven't found anything faster and more correct than 'git checkout
> origin/master; git branch -D master; git checkout -b master'.

I've tested that locally on btrfs, and it works for me. It test by running `stat skel.ebuild` before and after, and it keeps the same inode number.
Comment 34 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-07-18 13:09:01 UTC
I think we should go for 'git reset --merge origin/master'. I no longer have the issues I originally had with it, so they were probably due to a dirty copy. It has the behavior closest to 'git pull', including appropriate tree status checks.
Comment 35 Zac Medico gentoo-dev 2016-07-18 16:35:40 UTC
Comment on attachment 440906 [details, diff]
refresh index before git reset

(In reply to Michał Górny from comment #34)
> I think we should go for 'git reset --merge origin/master'. I no longer have
> the issues I originally had with it, so they were probably due to a dirty
> copy. It has the behavior closest to 'git pull', including appropriate tree
> status checks.

Thanks for looking into this. I've gone ahead and pushed your suggested change:

https://gitweb.gentoo.org/proj/portage.git/commit/?id=55aef9bf297ef8cbf29921acb454449d01313818
Comment 36 Jason A. Donenfeld gentoo-dev 2016-09-20 19:29:05 UTC
Looks like this landed in the latest portage release. In case anybody else is startled by it, to get the old behavior back, add "sync-depth = 0" to your repos.conf file.
Comment 37 Martin Väth 2016-10-15 10:20:01 UTC
I had tried the

git reset --merge origin/master

strategy in portage-postsyncd-mv, but today it errored out on the
https://anongit.gentoo.org/git/data/glsa.git
repository, although I certainly never touched any file in it locally:

error: Entry 'glsa-201610-02.xml' not uptodate. Cannot merge.

It seems that the only reliable way to honour upstream rebasing (or whatever change it was which caused the above issue) seems to be

git reset --hard origin/master

I think it is quite reasonable to distinguish by an option whether the user wants to discard local changes or whether he wants to keep local changes as far as possible: In the first case, the right way is probably git reset --hard, because in the other case, no matter which strategy is chosen, some problems might arise.
Note that even strategies like -Xtheirs have their issues, because they do not remove deleted files reliably. Other strategies are discussed in
http://stackoverflow.com/questions/4911794/git-command-for-making-one-branch-like-another/4912267
Comment 38 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-10-15 17:53:55 UTC
(In reply to Martin Väth from comment #37)
> I had tried the
> 
> git reset --merge origin/master
> 
> strategy in portage-postsyncd-mv, but today it errored out on the
> https://anongit.gentoo.org/git/data/glsa.git
> repository, although I certainly never touched any file in it locally:
> 
> error: Entry 'glsa-201610-02.xml' not uptodate. Cannot merge.
> 
> It seems that the only reliable way to honour upstream rebasing (or whatever
> change it was which caused the above issue) seems to be

It's your filesystem being unhappy with git. If you run 'git status' first to have git re-evaluate all the local files, it should be happy to proceed. I have to do this all the time because of overlayfs.
Comment 39 Martin Väth 2016-10-15 19:59:05 UTC
(In reply to Michał Górny from comment #38)
> 
> It's your filesystem being unhappy with git. If you run 'git status' first
> to have git re-evaluate all the local files, it should be happy to proceed.
> I have to do this all the time because of overlayfs.

The original filesystem was indeed overlayfs (+squashfs/ext4).
Therefore, I still had a copy of the original repository. I have now retried (by archiving this repository and unpacking) on various filesystems:
ext4, tmpfs, and again on overlayfs (squashfs with writable part on ext4):

In all 3 cases, there was the same error, but "git status" indeed solved the issue - thanks for the hint.

Perhaps this should also be done by portage?

At least portage-2.3.2 does not seem to contain a "git status" call in the git module. Although this call takes about 5 seconds for the gentoo repository and writes about 12 MB (measured with overlayfs), I think not erroring out is more important than a few seconds more for syncing and 12 MB...
Comment 40 Zac Medico gentoo-dev 2016-10-15 20:07:58 UTC
(In reply to Martin Väth from comment #39)
> Therefore, I still had a copy of the original repository. I have now retried
> (by archiving this repository and unpacking) on various filesystems:
> ext4, tmpfs, and again on overlayfs (squashfs with writable part on ext4):
> 
> In all 3 cases, there was the same error, but "git status" indeed solved the
> issue - thanks for the hint.

When you copy the files to a new filesystem, you need to run "git status" at least once in order to update the index ("git update-index --refresh" might be an alternative).

> Perhaps this should also be done by portage?

Unless there's a bug in git, it should not be needed under normal operating conditions (with overlayfs being an exceptional case). We can add a repos.conf setting to force the index to be refreshed.
Comment 41 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-10-15 20:29:19 UTC
It would be nice to actually talk this over with git upstream. However, I never got around to trying to figure this out in more detail, and I don't have the time to work on that right now.
Comment 42 Martin Väth 2016-10-16 06:09:53 UTC
The issue is really mysterious:

> ("git update-index --refresh" might be an alternative).

I wanted to try whether this or --really-refresh (as you had mentioned earlier) would be necessary, and found that all of a sudden nothing of these are needed: I can unpack the "broken" archive on every filesystem and I can just use fetch+reset --merge without error

It seems that the only commit in the remote repository was the adding of an unrelated file (glsa-201610-10.xml)
Comment 43 Martin Väth 2016-10-16 06:43:57 UTC
(In reply to Zac Medico from comment #40)
> git update-index --refresh

Just for the record: According to the git sources, "git status" is doing

git update-index --refresh -q --unmerged

Both of these options are perhaps necessary to avoid problems in certain cases.
Comment 44 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-10-16 11:13:03 UTC
It may have something to do with filesystem cache. I have that problem every reboot, so I'll check that for you when I get home.
Comment 45 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-10-16 11:59:17 UTC
Plain 'git update-index --refresh' works fine for me.
Comment 46 Justus Ranvier 2016-10-22 20:56:21 UTC
This change has completely broken Portage handling git repos for me.

Even when I specify "sync-depth = 0" in repos.conf files, Portage *still* checks out shallow clones, even if I already had a full copy of the repo.

Which versions of Portage do I need to mask in order to avoid the new behaviour?
Comment 47 Zac Medico gentoo-dev 2016-10-22 21:54:52 UTC
(In reply to Justus Ranvier from comment #46)
> This change has completely broken Portage handling git repos for me.
> 
> Even when I specify "sync-depth = 0" in repos.conf files, Portage *still*
> checks out shallow clones, even if I already had a full copy of the repo.

Please file a new bug and attach your repos.conf file.

> Which versions of Portage do I need to mask in order to avoid the new
> behaviour?

It's portage-2.3.1 and later.
Comment 48 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2016-10-30 22:00:39 UTC
commit f77fcd6b0b4ebb49ca62f5767cd5c931127c3dbb
Author: Michał Górny <mgorny@gentoo.org>
Date:   Sun Oct 30 20:14:11 2016

    [sync] Run `git update-index --refresh` when doing shallow pulls
    
    Run `git update-index --refresh` to force proper index recheck before
    running `git reset --merge` on a shallow pull. This fixes syncing on
    some filesystem configurations including overlayfs on squashfs.
    
    Reviewed-by: Zac Medico <zmedico@gentoo.org>
Comment 49 Coacher 2016-11-02 00:11:54 UTC
(In reply to Michał Górny from comment #48)
> commit f77fcd6b0b4ebb49ca62f5767cd5c931127c3dbb

I work with overlays (science, sage-on-gentoo) that don't regularly update their profiles/use.local.desc files, but still these files are present in repos.
I regenerate the files manually via a simple egencache wrapper from repo.postsync.d. Everything worked smoothly, but now portage refuses to sync these overlays because it detects local changes due to regenerated profiles/use.local.desc files.

Can smth be done about this case, please?
I believe my wish to have an up-to-date profiles/use.local.desc is sane and simple.
Comment 50 Zac Medico gentoo-dev 2016-11-02 00:23:53 UTC
(In reply to Coacher from comment #49)
> Can smth be done about this case, please?
> I believe my wish to have an up-to-date profiles/use.local.desc is sane and
> simple.

Maybe it helps if you ask the overlay add /profiles/use.local.desc to .gitignore.
Comment 51 Oleh 2016-11-02 05:13:25 UTC
Can we get all of this git experiments back to portage-2.2.2x because it worked nicely. Users do not really care about whether its 'git pull' or 'git fetch' as long their trees synced. It;s not ok distribution wise to make such breaking changes without understanding what users need from "sync". Proper way to do this is to create separate branch like 
git-sync-revise. Ask developers and interested users to test this to death. Then merge to master. It is not tested enough to be in any of portage-2.3.x releases.
Comment 52 Martin Väth 2016-11-02 08:07:01 UTC
(In reply to Coacher from comment #49)
> I regenerate the files manually via a simple egencache wrapper from
> repo.postsync.d. Everything worked smoothly, but now portage refuses to sync
> these overlays because it detects local changes due to regenerated
> profiles/use.local.desc files.

This issue is not related to this bug.

Anyway, as Zac already suggested, the correct solution would be if upstream provides a matching .gitignore which contains profiles/use.local.desc and which should also contain metadata/md5-cache unless upstream provides an up-to-date cache, so that you can also update this cache with egencache (I recommend this if you use eix).

For overlays in which upstream fails to do this, you can do:

1. Provide a .gitignore containing these entries as well as .gitignore.

2. If upstream provides a .gitignore, but without these entries, you can edit your local gitconfig (either in the corresponding .git or perhaps even system wide if it matches to all your overlays) to contain
[core]
	excludesfile = /some/path/to/gitignore
and create a corresponding gitignore file with the required entries.
Comment 53 Martin Väth 2016-11-02 08:24:17 UTC
(In reply to Oleg from comment #51)
> Can we get all of this git experiments back to portage-2.2.2x because it
> worked nicely.

It didn't work, because users wanting a shallow clone would not get it.

> Proper way to do this is to create separate branch like git-sync-revise.
> Ask developers and interested users to test this to death.

That's what the testing version of portage is for, isn't it?
The "missing" git update-index was observed only recently, because apparently it only hits users with non-standard setups in certain exceptional situations, and thus its detection could easily be longer than any testing phase. But also this problem was fixed quickly soon after its detection.
So IMHO, the upgrade policy was completely correct.
Comment 54 Coacher 2016-11-02 10:58:06 UTC
(In reply to Martin Väth from comment #52)
> This issue is not related to this bug.
It is. Before changes related to this bug were introduced, portage behaved nicely.

> Anyway, as Zac already suggested, the correct solution would be if upstream
> provides a matching .gitignore which contains profiles/use.local.desc ...
Unless this solution becomes a policy and this policy is enforced, there will be overlays that serve outdated use.local.desc file without a matching entry in .gitignore.

> For overlays in which upstream fails to do this, you can do:
> ...
Thank you for the suggestion, but this is an ugly workaround. Of course I'll forget to update system-wide git configuration when I add a new overlay. Also why portage should require extra configuration of a third-party tool to function normally?
Comment 55 Martin Väth 2016-11-02 14:34:19 UTC
(In reply to Coacher from comment #54)
> (In reply to Martin Väth from comment #52)
> > This issue is not related to this bug.
> It is. Before changes related to this bug were introduced, portage behaved
> nicely.

I had this problem much earlier. It is very likely just accidental that the change of portage came with an outdated use.local.desc in your corresponding overlay.

> Unless this solution becomes a policy and this policy is enforced

You cannot enforce any policy on overlays.

> will be overlays that serve outdated use.local.desc file without a matching
> entry in .gitignore.

If they have an entry in .gitignore they actually should not serve any use.local.desc at all.

> Thank you for the suggestion, but this is an ugly workaround.

I know, but I think it is the only thing which can be done about this issue.

> Of course I'll forget to update system-wide git configuration
> when I add a new overlay.

If it is system-wide (in /etc/gitconfig) you don't have to update it. But this means that you will have to use egencache even for overlays which would provide a working use.local.desc or even an md5-cache.

> Also why portage should require extra configuration of a third-party tool to
> function normally?

The problem is broken upstream of your overlay: Shipping an outdated use.local.desc is simply broken.
Things become even more involved: If upstream uses a different VCS than git, the .gitignore trick cannot be used. For example, I don't know similar workarounds for bzr or mercurial overlays: If upstream does not include correct use.local.desc and md5-cache, you are probably lost...
Comment 56 Coacher 2016-11-02 15:10:45 UTC
(In reply to Martin Väth from comment #55)
> I had this problem much earlier. It is very likely just accidental that the
> change of portage came with an outdated use.local.desc in your corresponding
> overlay.
No. Outdated files were there since portage got repos.conf support, i.e. for months. I use plain ext4 though.

> You cannot enforce any policy on overlays.
Exactly my point.

> If it is system-wide (in /etc/gitconfig) you don't have to update it.
Say I have just added overlay foo in repos.conf.
How my system-wide git config/gitignore file for overlays will know about it automatically?

> The problem is broken upstream of your overlay: Shipping an outdated
> use.local.desc is simply broken.
I know it's the root cause. Still overlay maintainers prefer to serve an outdated use.local.desc instead of not serving it at all and I cannot do anything about it. They even acknowledge it's outdated and manually update it from time to time.
Comment 57 Martin Väth 2016-11-02 16:34:44 UTC
(In reply to Coacher from comment #56)
> No. Outdated files were there since portage got repos.conf support, i.e. for
> months.

"git pull" (which portage used before) will always complain about locally modified files. Something else must have changed if the file was really locally modified and git had not complained. Perhaps the distributed .gitignore had been changed...

> I use plain ext4 though

I am very sure that this is unrelated here.

> Say I have just added overlay foo in repos.conf.
> How my system-wide git config/gitignore file for overlays will know about it
> automatically?

It is system wide, i.e. it will apply to _every_ repository (of every of user).
The problem is more the opposite: It will then apply even to repositories which are not related to gentoo at all...

> overlay maintainers prefer to serve an outdated use.local.desc [...]
> and I cannot do anything about it.

Broken upstream is not anything which can be solved technically. It is the same if any other upstream serves some broken git repository which needs some local patching: You have to apply the patches manually or write some tricky code to apply them automatically. But that tricky code in turn might always break...
Comment 58 Mike Gilbert gentoo-dev 2016-11-02 17:10:25 UTC
Coacher's problem seems to stem from portage making the following call:

git update-index --refresh

Testing with a dummy repo:

 % emaint sync -r dummy
>>> Syncing repository 'dummy' into '/home/floppym/repos/dummy'...
/usr/bin/git fetch origin --depth 1
remote: Total 0 (delta 0), reused 0 (delta 0)
profiles/use.local.desc: needs update
!!! git pull error in /home/floppym/repos/dummy

Action: sync for repo: dummy, returned code = 1


Reading the manpage for the git-update-index command:

       -q
           Quiet. If --refresh finds that the index needs an update, the
           default behavior is to error out. This option makes git
           update-index continue anyway.


If I modify the git sync module to always pass -q to git-update-index, syncing a repository with local changes to use.local.desc succeeds.

 % emaint sync -r dummy
>>> Syncing repository 'dummy' into '/home/floppym/repos/dummy'...
/usr/bin/git fetch origin --depth 1
remote: Total 0 (delta 0), reused 0 (delta 0)
=== Sync completed for dummy

Action: sync for repo: dummy, returned code = 0
Comment 59 Mike Gilbert gentoo-dev 2016-11-02 17:13:10 UTC
Created attachment 452192 [details, diff]
always pass -q to git-update-index

Patch for my previous comment.
Comment 60 Mike Gilbert gentoo-dev 2016-11-02 17:18:43 UTC
Created attachment 452194 [details, diff]
always pass -q to git-update-index

Cleaner version of previous patch.
Comment 61 Mike Gilbert gentoo-dev 2016-11-02 17:28:27 UTC
Created attachment 452196 [details, diff]
always pass -q to git-update-index

Better commit message.
Comment 62 Martin Väth 2016-11-02 20:28:34 UTC
(In reply to Mike Gilbert from comment #58)
> Coacher's problem seems to stem from portage making the following call:
> 
> git update-index --refresh

That's interesting, I wasn't considering this in my reply to Coacher:
It means that Coacher's remark that "it worked before" did not mean "before the patch from this bug" (as I falsely assumed for my reply) but actually:

"_after_ the patch from comment #9 but before the very latest patch"

Indeed, _as a side effect_ the patch from comment #9 solves Coacher's problem (at least with git reset --hard; I am not sure concerning git reset --merge: Maybe in that case, some history will successively grow with each sync?).

This indicates that perhaps the new merge strategy should be always used, i.e. even if unlimited depth is desired.

> always pass -q to git-update-index

I would like to remind once more that besides -q also --unmerged should perhaps be passed.
Comment 63 Mike Gilbert gentoo-dev 2016-11-02 21:02:33 UTC
There is no patch directly associated with comment 9. Typo?
Comment 64 Coacher 2016-11-02 21:16:35 UTC
I replied initially to a specific comment that referred to a specific commit.
So 'after' and 'before' refers to that specific commit as well.
Comment 65 Martin Väth 2016-11-03 07:09:55 UTC
(In reply to Mike Gilbert from comment #63)
> There is no patch directly associated with comment 9. Typo?

Not a typo. I meant the change described there:
"git pull" being replaced by "git fetch [...] && git reset --hard" which meanwhile is in portage in the form "git fetch [...] && git reset --merge".
As mentioned, I do not understand the meaning of the latter well enough to judge whether it still "solves" Coacher's problem or whether it just "shifts" it by letting some git history grow with each syncing.
Comment 66 Zac Medico gentoo-dev 2016-11-03 20:12:09 UTC
Fixed in the master branch to use git update-index -q --unmerged --refresh:

https://gitweb.gentoo.org/proj/portage.git/commit/?id=d075422a8902617833ec945d94beb0bb334d44c9
Comment 67 Coacher 2016-11-04 14:35:17 UTC
(In reply to Zac Medico from comment #66)
> Fixed in the master branch to use git update-index -q --unmerged --refresh:
> 
> https://gitweb.gentoo.org/proj/portage.git/commit/
> ?id=d075422a8902617833ec945d94beb0bb334d44c9
Thank you. This commit resolves problems I've described above.
Comment 68 Zac Medico gentoo-dev 2016-11-06 19:15:50 UTC
Given the performance issues introduced by `git update-index` and `git prune` (see bug 599008), I think it's time to revert all changes related to this bug. We can consider adding optional support for shallow fetch, but shallow fetch doesn't seem to be a practical default at this time.
Comment 69 Zac Medico gentoo-dev 2016-11-06 19:41:49 UTC
Here is my proposal to revert all changes related to this bug:

https://archives.gentoo.org/gentoo-portage-dev/message/e0314d5c748ec4098605c20d9b42b2a9
Comment 70 Oleh 2016-11-07 07:18:50 UTC
(In reply to Zac Medico from comment #69)
> Here is my proposal to revert all changes related to this bug:
> 
> https://archives.gentoo.org/gentoo-portage-dev/message/
> e0314d5c748ec4098605c20d9b42b2a9

Thx, this is sane approach. There is no evident gain in this changes but known (and perhaps unknown) regressions. It really make sense to do initial repo clone (called by emerge --sync) with --depth=1. When it's done, emerge --sync should do `git pull` so that is correctly getting diffs. I'd say after revert, stabilization of 2.3.3 (when tagged) version make sense.
Comment 71 Zac Medico gentoo-dev 2016-11-07 22:20:33 UTC
Reverted for now:

https://gitweb.gentoo.org/proj/portage.git/commit/?id=ab840ac982d3c8b676b89f6bedd14e85dd06870f

In the next iteration, I think we're going to want separate sync-depth and clone-depth settings, or something like that.
Comment 72 Andrew Savchenko gentoo-dev 2017-02-17 20:51:16 UTC
(In reply to Oleg from comment #70)

> Thx, this is sane approach. There is no evident gain in this changes

The gain is very evident: shallow git clone consumes less disk space. Then it depends on setup: on some this difference is sufficient, on others it can be ignored. But with time number of commits will grow and full clone will be more and more expensive, so in the long run this problem must be solved.
Comment 73 Zac Medico gentoo-dev 2017-02-18 03:07:29 UTC
(In reply to Zac Medico from comment #71)
> In the next iteration, I think we're going to want separate sync-depth and
> clone-depth settings, or something like that.

This patch renames sync-depth to clone-depth, and shows a warning message if the deprecated sync-depth option is used:

https://archives.gentoo.org/gentoo-portage-dev/message/768eab99bffa6c2615a0dbf60109ee6c
https://github.com/gentoo/portage/pull/118
Comment 74 Zac Medico gentoo-dev 2017-02-22 09:09:10 UTC
(In reply to Zac Medico from comment #73)
> (In reply to Zac Medico from comment #71)
> > In the next iteration, I think we're going to want separate sync-depth and
> > clone-depth settings, or something like that.
> 
> This patch renames sync-depth to clone-depth, and shows a warning message if
> the deprecated sync-depth option is used:
> 
> https://archives.gentoo.org/gentoo-portage-dev/message/
> 768eab99bffa6c2615a0dbf60109ee6c
> https://github.com/gentoo/portage/pull/118

That's in the master branch now:

https://gitweb.gentoo.org/proj/portage.git/commit/?id=b3f6297a791a748a4bca370283705576568a20c2
Comment 75 Zac Medico gentoo-dev 2018-07-09 15:34:16 UTC
I have some ideas for the next iteration of shallow pull support:

* Add support for a "sync-git-update-index = yes" setting in repos.conf, which people can enable when necessary (like for overlayfs).

* Trigger periodic calls to git prune (for bug 599008), based on the timestamp of .git/FETCH_HEAD. We can add a repos.conf setting to control the time interval between prune calls, for example "sync-git-prune-interval-days = 7" would prune every 7 days, and "sync-git-prune-interval-days = 0.5" would prune every 12 hours. Maybe we should support syntax like "sync-git-prune-interval = 7d" or "sync-git-prune-interval = 12h". I really don't know how often we would have to prune if someone syncs at a high frequency like every 30 minutes.
Comment 76 Zac Medico gentoo-dev 2018-07-09 17:02:28 UTC
(In reply to Zac Medico from comment #75)
> * Trigger periodic calls to git prune (for bug 599008), based on the
> timestamp of .git/FETCH_HEAD.

On second thought, the timestamp of .git/FETCH_HEAD is not very useful since it changes with each fetch even when nothing has changed. So, I think we need to store a timestamp of the last prune somewhere, so that we know when to trigger the next prune. Along with the timestamp of the last prune, we could store the HEAD commit hash and avoid pruning again for the same commit hash.
Comment 77 Zac Medico gentoo-dev 2018-07-09 18:18:52 UTC
If we have a state file, we can track the number of times that the HEAD has changed since the last prune. I'm thinking that we could have one json file per repository that's automatically loaded and passed into the sync module, then automatically saved afterwards if the sync module makes any changes.
Comment 78 Martin Väth 2018-07-10 05:06:19 UTC
Git can do more than just pruning, and usually I call this sequence on all git repositories somewhat regularly:

git prune
git repack -a -d
git reflog expire --expire=now --all
git gc --prune=all --aggressive
git repack -a -d
git prune

(yes, the repetition is intentional).

However, for the git repository (even of full depth) some of this breaks the repository occasionally for reasons I do not remember. Maybe it was git prune.

Anyway, if you use a timer anyway, it might be appropriate to do the working parts of this sequence somewhat regularly.

Moreover, in the moment fetching from the source repository is not supported, and one has to use postsync.d hoooks like here https://github.com/vaeth/portage-postsyncd-mv to fetch additional data like news, glsa, xml-schema, dtd, projects.xml. Perhaps also management of developer keys and verification of latest commits with these keys should be done here?
Perhaps this functionality should eventually become part of portage itself?

Note that all of these tasks usually should be done much less frequently than syncing. The above-mentioned scripts use their own timestamp-files, but an independent file containing all timestamps in an extendible format might perhaps be even better.
Comment 79 Zac Medico gentoo-dev 2018-07-10 06:29:27 UTC
(In reply to Martin Väth from comment #78)
> Git can do more than just pruning, and usually I call this sequence on all
> git repositories somewhat regularly:
> 
> git prune
> git repack -a -d
> git reflog expire --expire=now --all
> git gc --prune=all --aggressive
> git repack -a -d
> git prune
> 
> (yes, the repetition is intentional).

Wow, thanks for that impressive list of commands. After reading the man pages for those commands, it seems that the simplest command for automated maintenance is this:

   git gc --auto

> However, for the git repository (even of full depth) some of this breaks the
> repository occasionally for reasons I do not remember. Maybe it was git
> prune.
>
> Anyway, if you use a timer anyway, it might be appropriate to do the working
> parts of this sequence somewhat regularly.

If possible I'd like to use `git gc --auto` by default, and maybe have an option for the user to specify a maintenance script in repos.conf. Maybe something like this for defaults:

sync-git-gc-cmd = git gc --auto

> Moreover, in the moment fetching from the source repository is not
> supported, and one has to use postsync.d hoooks like here
> https://github.com/vaeth/portage-postsyncd-mv to fetch additional data like
> news, glsa, xml-schema, dtd, projects.xml. Perhaps also management of
> developer keys and verification of latest commits with these keys should be
> done here?
> Perhaps this functionality should eventually become part of portage itself?

I'd prefer not to bundle something that complex with portage, so a custom sync module seems like a good way to go. For now, I'm mainly interested in targeting the gentoo git mirror as described here:

https://wiki.gentoo.org/wiki/Portage_Security#git-mirror_repo

> Note that all of these tasks usually should be done much less frequently
> than syncing. The above-mentioned scripts use their own timestamp-files, but
> an independent file containing all timestamps in an extendible format might
> perhaps be even better.

I wonder if `git gc --auto` is suitable to run for every sync, since it does nothing unless the the gc.auto (default 6700) or gc.autoPackLimit (default 50) threshold is exceeded. That would eliminate the need for the sync module to store state of its own.
Comment 80 Martin Väth 2018-07-10 07:10:21 UTC
> For now, I'm mainly interested in targeting the gentoo git mirror

I know; I was mainly posting this as a hint to keep a possible timestamp file open for future extensions.

The problem with the gentoo git mirror is that it has a quickly growing history for metadata/md5-cache changes (with every eclass change) which nobody needs on his harddisk.
Comment 82 Larry the Git Cow gentoo-dev 2018-07-11 06:19:21 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=903c4b1a67689c4b8cc59113a56d58575cf7db8e

commit 903c4b1a67689c4b8cc59113a56d58575cf7db8e
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2018-07-10 07:03:35 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2018-07-11 06:10:41 +0000

    GitSync: support sync-depth (bug 552814)
    
    Support sync-depth for shallow sync, using git reset --merge just
    like in the earlier implementation that was reverted in commit
    ab840ac982d3c8b676b89f6bedd14e85dd06870f. Also, run git gc --auto
    in the foreground, in order to trigger periodic housekeeping and
    hopefully avoid errors from automatic git gc calls as reported in
    bug 599008.
    
    The default sync-depth is unlimited, which means that default
    behavior remains unchanged (unlike the previous implementation that
    was reverted).
    
    Bug: https://bugs.gentoo.org/552814
    Bug: https://bugs.gentoo.org/599008

 man/portage.5                       |  3 ++-
 pym/portage/repository/config.py    |  4 ----
 pym/portage/sync/modules/git/git.py | 26 +++++++++++++++++++++++++-
 3 files changed, 27 insertions(+), 6 deletions(-)
Comment 83 Coacher 2018-07-11 19:22:53 UTC
Do already existing fully cloned trees need to be re-pulled or portage will take care of it automatically during next sync?
Comment 84 Zac Medico gentoo-dev 2018-07-11 19:28:05 UTC
(In reply to Coacher from comment #83)
> Do already existing fully cloned trees need to be re-pulled or portage will
> take care of it automatically during next sync?

It's automatic. If you have any local branches that you don't need then you should remove them so that the objects that they reference can be garbage collected.
Comment 85 Alex Efros 2018-07-12 04:28:11 UTC
Is there any comparison rsync VS git about difference in sync speed and used disk space for common use case (like ext4 without overlayfs/unionfs)?
Comment 86 Nikos Chantziaras 2018-07-12 05:54:46 UTC
(In reply to Coacher from comment #83)
> Do already existing fully cloned trees need to be re-pulled or portage will
> take care of it automatically during next sync?

Nothing changes by default. You need to use "sync-depth = 1" in your repos.conf. The size of the portage repo also didn't change for me. I had to delete it and then re-sync, which cut its size in half.
Comment 87 Nico Baggus 2018-07-12 20:36:06 UTC
pity that portage claims it needs to be clone-depth in stead of sync-depth...

/usr/lib64/python2.7/site-packages/portage/repository/config.py:182: UserWarning: repos.conf: sync-depth is deprecated, use clone-depth instead
  warnings.warn(_("repos.conf: sync-depth is deprecated,"


kind of misses the issue in the news item....

[3] https://bugs.gentoo.org/552814 sys-apps/portage: support shallow
    git pull by setting sync-depth = 1 in repos.conf
Comment 88 Zac Medico gentoo-dev 2018-07-12 20:44:32 UTC
(In reply to Nico Baggus from comment #87)
> pity that portage claims it needs to be clone-depth in stead of sync-depth...
> 
> /usr/lib64/python2.7/site-packages/portage/repository/config.py:182:
> UserWarning: repos.conf: sync-depth is deprecated, use clone-depth instead
>   warnings.warn(_("repos.conf: sync-depth is deprecated,"
> 
> 
> kind of misses the issue in the news item....
> 
> [3] https://bugs.gentoo.org/552814 sys-apps/portage: support shallow
>     git pull by setting sync-depth = 1 in repos.conf

That UserWarning means you need to upgrade to sys-apps/portage-2.3.42 first.