Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 81425 - Offering deltas for distfiles to minimize bandwidth/dl cost for users
Summary: Offering deltas for distfiles to minimize bandwidth/dl cost for users
Status: RESOLVED LATER
Alias: None
Product: Gentoo Infrastructure
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All All
: High normal (vote)
Assignee: Gentoo Infrastructure
URL: http://glep.gentoo.org/glep-0025.html
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-02-09 15:48 UTC by Brian Harring (RETIRED)
Modified: 2005-10-01 01:41 UTC (History)
8 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Brian Harring (RETIRED) gentoo-dev 2005-02-09 15:48:16 UTC
So... starting up the glep25 (distfiles diffing thing again), and have some figures in hand this time around.

Refresher in the meantime.
What I'm proposing is the addition of binary patches to our mirroring infrastructure (whether the existing distfiles mirrors, or a new class of mirrors).  Obviously it requires more space (current figure is 3.2gb +/-300mb, explained below), but for our mirrors it opens up the possibility of substantial decreased bandwidth.

For example.  Using linux releases 2.6.8, 2.6.9, and 2.6.10; individually, each release weighs in at 36mb +/-.5mb.
The delta (patch) for 2.6.8 -> 2.6.9 is 1.7mb
the delta for 2.6.9 -> 2.6.10 is 2.2mb.

Say a user has the 2.6.8 release on their harddrive, and wishes to merge a 2.6.10 release- they fetch only 3.9mb, instead of 36mb (89% saving).
If they already have 2.6.9 on their hd, they only must fetch 2.2mb instead of 36mb (94% saving).

Obviously, for dial up/low bandwidth users this is a godsend, and is a rather nice decrease in bandwidth usage for our mirrors.  In cases where the the changes are truly minor in a release, the savings are quite large.
fex- 10.53mb for kdelibs-3.1.4, 10.54mb for kdelibs-3.1.5.  The patch is 75.6kb, a saving of >99% in bandwidth usage for that version bump.  Another example is linux v2.6.8 -> v2.6.8.1; patch is 90kb, full source is 35mb, >99.7% saving.

So yeah... obviously I'm pointing out cases where diffing truly is massively 
more efficient.  If the source gets radically changed each time between releases, diff'ing those files won't be as effective.  Diff'ing against zip files, not really doable (need to reverse the compression to diff), same for rpm.  Diffing against jar files doesn't do incredibly well either, nor does diffing against 'true' binaries (there are differs that can handle binaries a bit better however).  Basically, the place where doing patches truly shines is against source releases (the majority of our distfiles).

I've gone through, and generated diffs for the entire tree (snapshot as of 02/06/05), and have stats should people desire.  I wrote out a set of scripts to walk the tree, and try to identify file 'lines', eg MPlayer-0.92.tar.bz2 and MPlayer-1.0_pre4.tar.bz2 are of the same 'line', and patches should be generated between them.  The script isn't perfect- a final solution would require some manual work by devs to ensure daft diffs aren't generated.  That, however is ancillary atm- main issue of this bug is to foster discussion of the space requirement for the patches.

The current total is 3,214 mb for patches.  I'd add a fudge factor of +/-300mb mainly since to get this figure, an automated solution involving a fair amount of fuzzness was required- eg, it didn't always get file lines correct, and I had to re do them by hand.  So, there is a human element in that figure- ~9000 patches generated, fair amount of room for me to miss bad lines (fex, having a linux v2.4.25 -> 2.6.10 *and* 2.4.22 -> 2.6.10 set of patches).  A final solution, as stated would require a bit of handling of it by devs (akin to ensuring the src_uri is valid, frankly).

So the question is, would A) mirror space be available, B) would this require creation of a new class of mirrors, C) would our volunteer mirrors (whether existing, or a new class) be interested in footing the storage cost (space) in this?

The benefit to our low-speed users is obviously, and there still is a benefit for people with faster bandwidth- unless you're getting 1mb/s, pulling down >15% of the target file's size saves a good chunk of time.

There are a few variables associated with this- people who wipe their distfiles clean daily, would obviously not be taking advantage of the patches, so they would be sucking down the full file.
Frankly, there isn't much that can be done about those users.  For the users who use their own distfiles cleansing script (cleanse, not wipe), a smarter version can be written that would be aware of patching support, and cleanse differently (eg, nuke all older versions, but ensure at least one version of a file line exists).  Additionally, to -really- go to town, and try to maximize the bandwidth savings, older patches for files no longer on our mirrors would be advisable.
The latter suggestion is debatable, and frankly a bit hard to nail down w/out actually implementing this- reasoon being, we don't truly know how often people update.  Those who stay on the cutting edge (weekly updates we'll say), would wind up primarily as patch fetchers for existing files.  Those that update once every 2-3 months, well, the sum of required patches to fetch may not warrant maintaining patches that long (say a cut off point being if the patchs required dl total is 80% of the target).

Thoughts/flames?
Comment 1 Brian Harring (RETIRED) gentoo-dev 2005-02-13 11:27:53 UTC
Snippet from an irc conversation re: bandwidth savings distfile diffing provides- figures btw, are accurate (eg, I have the patches, and the stats for the # of unique downloads for portage in the first week carpaski should be able to provide you)...

13:18 <@ferringb> the -r15 release in the first week or so was 35000 unique downloads, around 277kb each.
13:18 <@ferringb> thats over 9.695gb in bandwidth.
13:18 < kusznir> Yea...mirrors are wonderful...... :)
13:19 <@ferringb> say half the people were upgrading from a previous version- the -r14 -> -r15 delta is 2kb compressed, say oldest version they were upgrading from was -r10, in which case the total delta fetched is 28kb.
13:20 <@ferringb> so 15 as a round figure... (277*17500)+(15*17500) ~= 5.11 gb.
13:20 <@ferringb> that's a 47% saving in bandwidth, at the cost of under 5% extra storage...
13:21 <@ferringb> that ^^^ is  nuts.  mirrors are wonderful, but they're voluntering their bandwidth too...
Comment 2 Brian Harring (RETIRED) gentoo-dev 2005-02-27 19:36:39 UTC
So... it was mentioned when I poked about this last week in the infra irc channel that experimental space would be possible/viable- cshields, iirc.

If so, 'k... what's needed to move forward space wise?
If that's not accurate, what's needed to move forward to discern if space is available, and/or where?

If at all possible, I'd prefer the discussion on this bug- easier to track what's going on, rather then digging through my (potentially) faulty memory.
Comment 3 Lance Albertson (RETIRED) gentoo-dev 2005-03-05 07:00:05 UTC
Once we nail down some stuff with distfiles, space shouldn't be a problem if its going to be in the 2-3G range. I've been working with carpaski to get a better distfile file checking/cleaning script and so far it'll free up about 14G of space for our mirrors. I've been waiting for that to get implemented and tested before we open up space for this. I'm hoping in the next two to three weeks that script will be nailed and I can start implementing it. 

As for a box for generating diffs.. I think I have a few that we can use. Remind me again how this process will work from our point of view for the scripts? Will we need a dedicated box that generates these diffs then sends it to the staging mirror? or could we just do this on the staging mirror itself? How much I/O will this require? 

One last thing, where did you want to put this stuff in our directory heirachy on the mirrors?

I'm sorry I didn't respond to this bug earlier, I kept meaning to do it, just hadn't gotten around to it. Hopefully this will help.
Comment 4 Brian Harring (RETIRED) gentoo-dev 2005-03-05 11:07:33 UTC
Wouldn'y worry about the delay, and while I may have a loud bark, it's just a bark :)

If you need a hand with the script (whether second set of eyeballs on it, or getting it finished) poke me.  Such a script is pretty easy as long as a tree is available.

Out of order response, /patches comes to mind, although it doesn't matter to me.

For diff generation, in rough order of the process, scan the tree for new additions, examine a persistant db (fs db, sub 5mb), do some voodoo to determine what to difference (generation of job queue), then start differencing.  Compress patches, send them off staging, update a patch list(s), push patch list(s) out to the rsync tree.

So, a fairly up to date tree is required.  In terms of differencing, it's heavily IO bound- so if the box has io-sensitive services, those services will feel it when the scripts start splitting patches.  If the box has a large amount of ram, the IO issue can be sidestepped by automatically differencing in memory- figure version1 + version2 + 120mb roughly.  Even doing version1 + 120mb will greatly reduce the IO load- seeks in version2 will be sequential the majority of the time.

Regarding running it on the staging mirror, it really depends on what the mirror can take.  Differencing (unless version1 is in memory) is basically a buttload of random seek/reads with occasional sequential access in version1, with 99% sequential for version2.  If the box doesn't ensure fair IO usage, then the differencer will probably become a hog and screw with the other services.

Elaborating on the ability to have a file pushed into the tree- either a single file with all patches listed, or multiple files distributed throughout the tree.  Original proposal was for N files distributed throughout the tree, although that isn't a hard req. (the info needs to go out with the tree, the question of multiple files or a single file is essentially just a simple portage implemenation issue).

A note on starting this beast up- to get this going, it's going to require generation of a *lot* of patches.  With my p4 2.4ghz, running strictly from disk, it took 3 days.  Not the fastest IO subsystem though.  This initial collection of patches is needed mainly since the distfile diff setup must provide patches for old versions up to new versions- if gaps exist in going from v1 to v2 (fex), any correct portage implementation will detect this, and correctly go the route of pulling a full distfile.

Sidenote, it would be useful to be able to check up on the generation script- either logs sent out, or (down the line) ability to view the persistant db mentioned above- basically a way to check up on the automatic generation and file lineage identification.
Comment 5 Brian Harring (RETIRED) gentoo-dev 2005-03-05 11:34:55 UTC
Addendum, if the file-lineage script (determine what to diff) *screws* up and false positives something, it would be needed to modify it's 'hints', so it doesn't do that again.  Related to it, ability to nuke a patch, or have it known this patch can be ixnayed now would be good.

Aside from that, there would need to be some long term determination of what patches are no longer worth keeping around- this preferably, should be log based although that may be tricky (this is down the line, say 3+ months past initial implementation).
Comment 6 Patrick Lauer gentoo-dev 2005-03-17 09:55:31 UTC
Brian,

as we really like this idea, we'd like to get a prototype working.
Ask me or dertobi123 for server ressources.

Comment 7 Lance Albertson (RETIRED) gentoo-dev 2005-03-17 10:36:52 UTC
Brian already has an infra box to do diff development on, but thanks anyways. 
Comment 8 Brian Harring (RETIRED) gentoo-dev 2005-04-10 06:14:51 UTC
Still need to iron out the mirror space issue for distfile patches btw...
aside from that, would like to sometime this week make snapshot delta generation active if you're game.
Comment 9 Lance Albertson (RETIRED) gentoo-dev 2005-08-15 18:16:50 UTC
How are we doing on this Brian? Can we move towards possibly putting this stuff
on the mirrors? Just curious how this project is doing.
Comment 10 Corey Shields 2005-09-30 22:16:53 UTC
We can put this on OSL's mirrors for a testbed (create a seperate tree for now, 
if it ends up working and production-worthy, then merge the diff tree into the 
main mirrors tree). 
 
Let me know where to sync it from... 
 
-C 
Comment 11 Brian Harring (RETIRED) gentoo-dev 2005-10-01 01:41:16 UTC
Merde.

Rather then a "yeah, I'm ...sort of working on it right now", being realistic,
marking this as later atm unless someone else steps in; it's still on my plate,
but I'm busy with the rewrite and have jack all time to screw with the
auto-identification algorithm for determining what patches to generate right now.

I'll update as I get time; the last algo mostly worked, but had notable failings.
If interested, give a scream to me and I'll throw the source your way- that
chunk of this needs to be nailed down to something fairly automatic.

The portage integration isn't going to happen till post rewrite anyways, since
it's a matter of priorities (at least from my standpoint).