Summary: | Offering deltas for distfiles to minimize bandwidth/dl cost for users | ||
---|---|---|---|
Product: | Gentoo Infrastructure | Reporter: | Brian Harring (RETIRED) <ferringb> |
Component: | Other | Assignee: | Gentoo Infrastructure <infra-bugs> |
Status: | RESOLVED LATER | ||
Severity: | normal | CC: | axxo, dertobi123, f5d8fd51ed1e804c9e8d0357e8614e0493b06e96, luckyduck, m.debruijne, patrick, plate, tove |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | All | ||
URL: | http://glep.gentoo.org/glep-0025.html | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- |
Description
Brian Harring (RETIRED)
2005-02-09 15:48:16 UTC
Snippet from an irc conversation re: bandwidth savings distfile diffing provides- figures btw, are accurate (eg, I have the patches, and the stats for the # of unique downloads for portage in the first week carpaski should be able to provide you)... 13:18 <@ferringb> the -r15 release in the first week or so was 35000 unique downloads, around 277kb each. 13:18 <@ferringb> thats over 9.695gb in bandwidth. 13:18 < kusznir> Yea...mirrors are wonderful...... :) 13:19 <@ferringb> say half the people were upgrading from a previous version- the -r14 -> -r15 delta is 2kb compressed, say oldest version they were upgrading from was -r10, in which case the total delta fetched is 28kb. 13:20 <@ferringb> so 15 as a round figure... (277*17500)+(15*17500) ~= 5.11 gb. 13:20 <@ferringb> that's a 47% saving in bandwidth, at the cost of under 5% extra storage... 13:21 <@ferringb> that ^^^ is nuts. mirrors are wonderful, but they're voluntering their bandwidth too... So... it was mentioned when I poked about this last week in the infra irc channel that experimental space would be possible/viable- cshields, iirc. If so, 'k... what's needed to move forward space wise? If that's not accurate, what's needed to move forward to discern if space is available, and/or where? If at all possible, I'd prefer the discussion on this bug- easier to track what's going on, rather then digging through my (potentially) faulty memory. Once we nail down some stuff with distfiles, space shouldn't be a problem if its going to be in the 2-3G range. I've been working with carpaski to get a better distfile file checking/cleaning script and so far it'll free up about 14G of space for our mirrors. I've been waiting for that to get implemented and tested before we open up space for this. I'm hoping in the next two to three weeks that script will be nailed and I can start implementing it. As for a box for generating diffs.. I think I have a few that we can use. Remind me again how this process will work from our point of view for the scripts? Will we need a dedicated box that generates these diffs then sends it to the staging mirror? or could we just do this on the staging mirror itself? How much I/O will this require? One last thing, where did you want to put this stuff in our directory heirachy on the mirrors? I'm sorry I didn't respond to this bug earlier, I kept meaning to do it, just hadn't gotten around to it. Hopefully this will help. Wouldn'y worry about the delay, and while I may have a loud bark, it's just a bark :) If you need a hand with the script (whether second set of eyeballs on it, or getting it finished) poke me. Such a script is pretty easy as long as a tree is available. Out of order response, /patches comes to mind, although it doesn't matter to me. For diff generation, in rough order of the process, scan the tree for new additions, examine a persistant db (fs db, sub 5mb), do some voodoo to determine what to difference (generation of job queue), then start differencing. Compress patches, send them off staging, update a patch list(s), push patch list(s) out to the rsync tree. So, a fairly up to date tree is required. In terms of differencing, it's heavily IO bound- so if the box has io-sensitive services, those services will feel it when the scripts start splitting patches. If the box has a large amount of ram, the IO issue can be sidestepped by automatically differencing in memory- figure version1 + version2 + 120mb roughly. Even doing version1 + 120mb will greatly reduce the IO load- seeks in version2 will be sequential the majority of the time. Regarding running it on the staging mirror, it really depends on what the mirror can take. Differencing (unless version1 is in memory) is basically a buttload of random seek/reads with occasional sequential access in version1, with 99% sequential for version2. If the box doesn't ensure fair IO usage, then the differencer will probably become a hog and screw with the other services. Elaborating on the ability to have a file pushed into the tree- either a single file with all patches listed, or multiple files distributed throughout the tree. Original proposal was for N files distributed throughout the tree, although that isn't a hard req. (the info needs to go out with the tree, the question of multiple files or a single file is essentially just a simple portage implemenation issue). A note on starting this beast up- to get this going, it's going to require generation of a *lot* of patches. With my p4 2.4ghz, running strictly from disk, it took 3 days. Not the fastest IO subsystem though. This initial collection of patches is needed mainly since the distfile diff setup must provide patches for old versions up to new versions- if gaps exist in going from v1 to v2 (fex), any correct portage implementation will detect this, and correctly go the route of pulling a full distfile. Sidenote, it would be useful to be able to check up on the generation script- either logs sent out, or (down the line) ability to view the persistant db mentioned above- basically a way to check up on the automatic generation and file lineage identification. Addendum, if the file-lineage script (determine what to diff) *screws* up and false positives something, it would be needed to modify it's 'hints', so it doesn't do that again. Related to it, ability to nuke a patch, or have it known this patch can be ixnayed now would be good. Aside from that, there would need to be some long term determination of what patches are no longer worth keeping around- this preferably, should be log based although that may be tricky (this is down the line, say 3+ months past initial implementation). Brian, as we really like this idea, we'd like to get a prototype working. Ask me or dertobi123 for server ressources. Brian already has an infra box to do diff development on, but thanks anyways. Still need to iron out the mirror space issue for distfile patches btw... aside from that, would like to sometime this week make snapshot delta generation active if you're game. How are we doing on this Brian? Can we move towards possibly putting this stuff on the mirrors? Just curious how this project is doing. We can put this on OSL's mirrors for a testbed (create a seperate tree for now, if it ends up working and production-worthy, then merge the diff tree into the main mirrors tree). Let me know where to sync it from... -C Merde. Rather then a "yeah, I'm ...sort of working on it right now", being realistic, marking this as later atm unless someone else steps in; it's still on my plate, but I'm busy with the rewrite and have jack all time to screw with the auto-identification algorithm for determining what patches to generate right now. I'll update as I get time; the last algo mostly worked, but had notable failings. If interested, give a scream to me and I'll throw the source your way- that chunk of this needs to be nailed down to something fairly automatic. The portage integration isn't going to happen till post rewrite anyways, since it's a matter of priorities (at least from my standpoint). |