I run emerge --sync once a day, and I am probably not the only one. My guess is out of the 60,000 files or so, about 1% may be changing every day. Would it make sense to have emerge do the following? after user "emerge", i.e. at download time, save a time/day stamp, and create an MD5 of all the files in my emerge directory. call this a snapshot stamp. at emerge --sync, transmit the snapshot stamp to the gentoo server. the gentoo server can now determine---and probably fairly easily--- which files have updated since. if so, tell the rsync recipient that only a subset is being rsynced. (well, this may or may not work with rsync. I am not a programmer.) if the MD5 is not equal to an emerge --sync site snapshot that is stored on the server, then fall back to the original full rsync protocol method we are now using. This need not be foolproof, in that we have our old fallback if the MD5 stamps do not match. Any commit by a packager could trigger an MD5 calculation on all snapshot w/ timestamps hanging around at the server, informing each snapshot that they now would like to go into the user update queue. After 3 days or so, we can delete the snapshot + timestamps. Similarly, every 7 days or so, we could have the user emerge force a normal standard rsync just to make sure everything is in order. just a suggestion... Reproducible: Always Steps to Reproduce: 1. 2. 3.
We already rely on timestamp files within the tree to do what you're suggesting; the difference is that your approach would require timestamp's pushing into each category (fex) to chunk up the syncing, and would be complicated by the fact that a $PORTDIR/dev-util change requires both that dir synced, and $PORTDIR/metadata/cache/dev-util . Additional issue, the md5 of each chunk of the users tree may not be accurate; the user may have modified an ebuild within that cat (fex), which means you cannot trust the md5, need to regenerate it every run, which is what rsync does; the saving in what you're proposing is the fact that chksum information isn't transmitted, lowering the 2.4mB overhead of a full tree rsync. Personally, I don't think this is the route to go; what you're after is effectively versioning the tree, knowing that it was at release x, and that to get to the current release z you need to pull the x->, y->z deltas an apply them. This is what emerge-delta-webrsync does, difference being emerge-delta-webrsync doesn't make assumptions about the user's tree being unmodified; it relies on tarsync (or rsync in worst case) to ensure the users tree is a copy of the targeted snapshot. So... Dunno. Chunking up rsync'ing into (fex) potentially per category has the added disadvantage of jacking up the # of connections per sync attempt; currently it's 2, say 50% of the categories have some form of change in them; with a per cat + md5 check rsync'ing, you're looking at (140 cats currently) 1 + (2 * (140*.5)); 1 for md5 info, 2x per cat for $PORTDIR/$CATEGORY and $PORTDIR/metadata/$CATEGORY ; this is also ignoring any form of syncing required for other directories, eclass/profiles/metadata fex. Offhand, I'd rather see an approach of emerge-delta-webrsync using uncompressed zip files, with portage running directly off the zip file; this has the added bonus of being easier to deal with for delta generation/reconstruction, and being a bit more full proof way of ensuring that the user doesn't screw around with the 'versioning'; it's a bit harder to do without knowing the effects of the action compared to just modifying a file in the tree, plus generating a delta for a single file is easier and allows for greater optimization of the patch. Meanwhile, cc'ing infra since Lance asked me for a bug of this sort, and I never quite got around to it ;)
So what's to happen with this bug then?
Per ferringb on IRC, There are a bunch of ways this can be done, and only a few that work. There are a lot of server-side issues to it as well, if I recall, so a good method must be carefully chosen. This however, is not it.