I run emerge --sync once a day, and I am probably not the only one. My guess is
out of the 60,000 files or so, about 1% may be changing every day. Would it
make sense to have emerge do the following?
after user "emerge", i.e. at download time, save a time/day stamp,
and create an MD5 of all the files in my emerge directory. call
this a snapshot stamp.
at emerge --sync, transmit the snapshot stamp to the gentoo server.
the gentoo server can now determine---and probably fairly easily---
which files have updated since. if so, tell the rsync recipient
that only a subset is being rsynced. (well, this may or may not
work with rsync. I am not a programmer.)
if the MD5 is not equal to an emerge --sync site snapshot that is
stored on the server, then fall back to the original full rsync
protocol method we are now using.
This need not be foolproof, in that we have our old fallback if the MD5 stamps
do not match. Any commit by a packager could trigger an MD5 calculation on all
snapshot w/ timestamps hanging around at the server, informing each snapshot
that they now would like to go into the user update queue.
After 3 days or so, we can delete the snapshot + timestamps. Similarly, every 7
days or so, we could have the user emerge force a normal standard rsync just to
make sure everything is in order.
just a suggestion...
Steps to Reproduce:
We already rely on timestamp files within the tree to do what you're suggesting;
the difference is that your approach would require timestamp's pushing into each
category (fex) to chunk up the syncing, and would be complicated by the fact
that a $PORTDIR/dev-util change requires both that dir synced, and
Additional issue, the md5 of each chunk of the users tree may not be accurate;
the user may have modified an ebuild within that cat (fex), which means you
cannot trust the md5, need to regenerate it every run, which is what rsync does;
the saving in what you're proposing is the fact that chksum information isn't
transmitted, lowering the 2.4mB overhead of a full tree rsync.
Personally, I don't think this is the route to go; what you're after is
effectively versioning the tree, knowing that it was at release x, and that to
get to the current release z you need to pull the x->, y->z deltas an apply
them. This is what emerge-delta-webrsync does, difference being
emerge-delta-webrsync doesn't make assumptions about the user's tree being
unmodified; it relies on tarsync (or rsync in worst case) to ensure the users
tree is a copy of the targeted snapshot.
So... Dunno. Chunking up rsync'ing into (fex) potentially per category has the
added disadvantage of jacking up the # of connections per sync attempt;
currently it's 2, say 50% of the categories have some form of change in them;
with a per cat + md5 check rsync'ing, you're looking at (140 cats currently)
1 + (2 * (140*.5)); 1 for md5 info, 2x per cat for $PORTDIR/$CATEGORY and
$PORTDIR/metadata/$CATEGORY ; this is also ignoring any form of syncing required
for other directories, eclass/profiles/metadata fex.
Offhand, I'd rather see an approach of emerge-delta-webrsync using uncompressed
zip files, with portage running directly off the zip file; this has the added
bonus of being easier to deal with for delta generation/reconstruction, and
being a bit more full proof way of ensuring that the user doesn't screw around
with the 'versioning'; it's a bit harder to do without knowing the effects of
the action compared to just modifying a file in the tree, plus generating a
delta for a single file is easier and allows for greater optimization of the patch.
Meanwhile, cc'ing infra since Lance asked me for a bug of this sort, and I never
quite got around to it ;)
So what's to happen with this bug then?
Per ferringb on IRC, There are a bunch of ways this can be done, and only a few
that work. There are a lot of server-side issues to it as well, if I recall, so
a good method must be carefully chosen. This however, is not it.