Summary: | emerge metadata is slow: "cpickle" cache problem? | ||
---|---|---|---|
Product: | Portage Development | Reporter: | Georgi Georgiev <chutz+bugs.gentoo.org> |
Component: | Core | Assignee: | Portage team <dev-portage> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | aaron123456789, betelgeuse, bugs+gentoo, bugs, crusaderky, dan, frank_meier, jrmalaq, kalium, mal, marian, mlspamcb, orionbelt2, radek, rdalek1967, sirjoltalot |
Priority: | High | ||
Version: | 2.1 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Bug Depends on: | |||
Bug Blocks: | 115839 |
Description
Georgi Georgiev
2005-10-07 10:58:05 UTC
eclass cache is a full write every time around for an update. I had a patch that sort of fixed this, although the results of it weren't as impressive as I'd like. Basically... you adjust the syncing rate of the eclass cache. Sync to disk once every N times rather then every eclass update. Downside is that a traceback during metadata transfer would result in everything since the last eclass cache syncing being invalidated. Additionally, this isn't a new issue. It's existed for quite some time; all the delayed syncing does is cut down on io and a bit of proc usage. Slow systems, an improvement, but I don't think it's going to give as major of a gain as people expect. Additionally... the cache rewrite already address the seperated eclass/ebuild caching (2 files) issue, since eclass data sits in the repository cache. Am I to understand that this whole issue will be solved by that cache rewrite. Anyway, just to illustrate the problem with cpickle, I did the following. cd /usr/lib/portage/pym cp portage_db_flat.py poratage_db_flatany.py fix the db_flatany to actually work as an alternative to cpickle, anydbm, etc... --- portage_db_flat.py 2005-10-08 01:55:36.000000000 +0900 +++ portage_db_flatany.py 2005-10-08 03:57:25.000000000 +0900 @@ -6,6 +6,7 @@ import types import os import stat +import cPickle import portage_db_template @@ -55,6 +56,11 @@ # nuke the newlines right off the batt. + data = myf.read() + myval = cPickle.loads(data) + myf.close() + return myval; + data = myf.read().splitlines() mdict = {} @@ -86,12 +92,11 @@ update_fp = self.fullpath + ".update." + str(os.getpid()) + "." + key myf = open(update_fp,"w") - myf.writelines( [ val[x] +"\n" for x in self.dbkeys] ) + myf.writelines( cPickle.dumps(val,cPickle.HIGHEST_PROTOCOL) ) myf.close() os.chown(update_fp, self.uid, self.gid) os.chmod(update_fp, 0664) - os.utime(update_fp, (-1,long(val["_mtime_"]))) os.rename(update_fp, self.fullpath+key) def del_key(self,key): The patch is left as short as possible for illustration purposes. Now perform a few tests with either of eclass_cache.dbmodule = "portage_db_cpickle.database" eclass_cache.dbmodule = "portage_db_flatany.database" in /etc/portage/modules rm -rf /var/cache/edb/dev/usr/portage before every "time emerge metadata" Results: portage_db_cpickle: ~ user 1m18s (5 tries, all within 1 second) portage_db_flatany: ~ user 0m11s (5 tries, all within 1 second) portage_db_cpickle: real between 1m40s and 2m22s. portage_db_flatany: real between 22s and 1m50s. Should mentioned that plenty of disc activity is observed with flatany. And for completeness: $ du -hsc *-eclass/ | tail -n1 60M total $ du -hsc $-eclass.cpickle | tail -n1 2.6M total Still, it is obvious that on my machine, cpickle cannot theoretically go under 1m18s, while flatany can and did. So, is this enough of an illustration about how non-optimal cpickle is? At least in the case of the -eclass cache. I'll wait for the new portage with the new caching. cpickle isn't the issue as much as the abuse of it. See my comment from above about the fact we're forcing a rewrite everytime; your reshuffling of the code demonstrates it- you're cpickling just the new additions to the eclass cache, not the whole eclass-cache. Stupid question, but anyone got a reason the cache rewrite can't be backported to stable? Requires adjusting portdbapi.aux_get calls, breaks any custom caches people have, and won't play nice with anyone who is sharing their local cache with others via nfs (bad idea anyways), and finally tweaking the config code... but it's also what's going to be used long term, and nukes the cpickling issue. jason, thoughts? backport of the 3x cache to stable is available (and needs testing) via http://dev.gentoo.org/~ferringb/potage/2.0/3.0-cache-backport.patch I respectfully disagree with you Brian, removing the sync calls completely [i.e letting .close, either when it finishes or when you do ctrl-c, do the sync once] and resetting the variable that sync checks to see if a sync is needed in the sync function speeds it up no end. Who cares if it dies and you lose data? That'd be a bug that needs fixing, not a reason to write bad code. It's a cache, not vital data - proven by the fact that to test removing the sync's sped it up I've been deleting the files - what's the point in syncing every write something that you can delete and it recreates it anyway? Besides, if it dies then the metadata for every category / ebuild after the one it crashed on would need to be done. Same as if you just hit ctrl-c rather than wait for the thing that's trying too hard to delete and write and delete and write and delete and write the same data pointlessly hundreds of times] - in short, if it crashes the previous syncs didn't help. I can't see it being just slow systems either because it's painfully slow even though the biggest eclass file [kde-base, the one that everyone gets stuck at 50% on] is only about 699k - which is absolute peanuts to write to an old disk. Ergo it seems nothing to do with a slow computer - whatever makes it slow must be either crud python's cpickle does to write that data or the fact that the the data is deleted and written out multiple times. Unless an AMD64 3000 with sata 7200 disks counts as slow? I know work is being done with backport patches et al but calling sync every time is too ugly to live anyway and deserves to die on principle. Perhaps I'm confused? I thought 2.0.53_rc7 had the cache speed up in? I upgraded to that because of this slow cache around 50%. But that made no difference. After looking at why, it seems because the unnecessary syncs afaict [and I've since found this bug] So, if 2.0.53_rc7 is supposed to speed up the cache, then it doesn't - at least for me - because what makes it slow is still in there. If it isn't, then fair enough, some better changes are on the way perhaps, but removing the sync calls here has made the so-called slow update cache go away - i.e afaict, that is what made it slow, so please consider removing the syncs in the interim, it's a tiny patch and a huge speed up afaict. To illustrate, remove the kde-base cpickle file [rest of metadata is up to date] michael ~ # rm /var/cache/edb/dep/usr/portage/kde-base-eclass.cpickle michael ~ # time emerge --metadata skipping sync >>> Updating Portage cache: 100% real 5m46.717s user 5m25.056s sys 0m21.217s Over 5 minutes of peeing into the wind, burning cpu, not i/o note - all around 50%-53% of the update Next, remove the file again, don't bother updating metadata and see how long emerge -pvuD takes michael ~ # rm /var/cache/edb/dep/usr/portage/kde-base-eclass.cpickle michael ~ # time emerge -pvuD world These are the packages that I would merge, in order: Calculating world dependencies ...done! [ebuild U ] kde-base/kde-i18n-3.4.3 [3.4.1] -arts -debug +kdeenablefinal -xinerama 0 kB ...snipped other kde-3.4.3 updates.... real 0m57.776s user 0m40.743s sys 0m15.489s It's slow, but it takes 5x less than creating the cache. Now remove the explicit sync calls from portage.py michael ~ # #patched portage.py with vi here michael ~ # rm /var/cache/edb/dep/usr/portage/kde-base-eclass.cpickle michael ~ # time emerge --metadata skipping sync >>> Updating Portage cache: 100% real 0m20.935s user 0m16.873s sys 0m3.688s Yabba dabba doo, 1500% quicker michael ~ # time emerge -pvuD world These are the packages that I would merge, in order: Calculating world dependencies ...done! [ebuild U ] kde-base/kde-i18n-3.4.3 [3.4.1] -arts -debug +kdeenablefinal -xinerama 0 kB ...snipped other kde-3.4.3 updates.... real 0m5.534s user 0m4.816s sys 0m0.496s ...and cache is worth using. > Perhaps I'm confused? I thought 2.0.53_rc7 had the cache speed up in? I upgraded > to that because of this slow cache around 50%. But that made no difference. > After looking at why, it seems because the unnecessary syncs afaict [and I've > since found this bug] > > So, if 2.0.53_rc7 is supposed to speed up the cache, then it doesn't - at least > for me - because what makes it slow is still in there. No, there is a patch that must be applied to rc7. > > If it isn't, then fair enough, some better changes are on the way perhaps, but > removing the sync calls here has made the so-called slow update cache go away - > i.e afaict, that is what made it slow, so please consider removing the syncs in > the interim, it's a tiny patch and a huge speed up afaict. Have you done any comparisons from rc7 and -r3? The % done has been a best guess at best and at worst it does this, sitting around 50% for most of the time instead of taking a smooth curve. sys-apps/portage-2.0.51.22-r3 localhost edb # time emerge --metadata >>> /var/cache/edb/dep doesn't exist, creating it... skipping sync >>> Updating Portage cache: 100% * IMPORTANT: 2 config files in /etc need updating. * Type emerge --help config to learn how to update config files. * An update to portage is available. It is _highly_ recommended * that you update portage now, before any other packages are updated. * Please do so and then update ALL of your configuration files. real 9m2.288s user 6m39.624s sys 0m42.414s sys-apps/portage-2.0.51.22 localhost portage # rm -rf /var/cache/edb/dep/ localhost portage # time emerge --metadata >>> /var/cache/edb/dep doesn't exist, creating it... skipping sync >>> Updating Portage cache: 100% * IMPORTANT: 2 config files in /etc need updating. * Type emerge --help config to learn how to update config files. * An update to portage is available. It is _highly_ recommended * that you update portage now, before any other packages are updated. * Please do so and then update ALL of your configuration files. real 8m49.860s user 6m36.501s sys 0m43.019s 13 seconds or so of difference. There is a new cache rewrite, it will be in the next minor version, if you don't like it you can apply the patch for it against rc7 or write your own patch for the slowness. >Have you done any comparisons from rc7 and -r3? The % done has been a best
>guess at best and at worst it does this, sitting around 50% for most of the time
>instead of taking a smooth curve.
? When I did the demo I've only deleted one file - and, as I said, it hangs
around 50%-53% doing kde-base.cpickle. Doing strace -p <pid> at 50-53% shows
kde-base.cpickle being written [as the first comment in the bug report states]
I'm pretty sure the %age is because of that file being synced. So yeah, the
curve isn't smooth, but that's because it sits doing kde-base.cpickle and
because the other cpickle stuff is a lot smaller by comparison.
The comparison in my first comment was between 2.0.51.22-r3 and _rc7. 13 seconds
I wouldn't have noticed though - I was looking for something dramatic - which I
assume is what you say is coming soon. Hooray :)
Or at least I think hooray. But your end comments confuse me. What's for me to
like or dislike? Either the new 2500+ lines of code like the patch almost linked
to this bug report has the same / similar dramatic speed up that the few lines
of changes I described have, or they don't.
If it doesn't then it doesn't fix the problem in this bug report - unless, as
you say, we all write our own portage patches. Are you saying the new changes
aren't compatible with the change I described - if they are, then is doing both
going to speed it up too much? :)
Clarification of the slow down, and why I stated "cpickle isn't the issue, the abuse of it is". As y'all have caught on here, eclass db updates (portage_db_cpickle) are trigger for every new consumer of that eclass. Current code sync's everytime, meaning it's essentially N^3, for N classes, N writes of the class contents (as N grows). Cache rewrite, the '2500 lines' removes the eclass db. It's a bad design having it anyways, what the rewrite does is centralize eclass data per cpv instead. The slowness (if any) is limited to the backend, rather then an a crappy portage mandated additional db. So... rewrite, you've got 35%-65% improvement. 65% on fresh writes, 35% on scanning for staleness and doing no updates. Why not just drop the sync calls? Well, frankly that's dumb. You drop the sync calls, a failed metadata (segfault fex) run means any ebuild that uses an eclass now has it's cache invalidated. My cache rewrite doesn't suffer this- eclass data is stored per ebuild (which is good for remote trees/cache dbs also). So... we're not removing the syncing. The cache rewrite allows for non-autocommit cache backends (sql based fex) to queue up updates, but that's a totally different scenario from disabling all sync'ing during metadata that you're advocating, and the affects of a segfault are pretty massively different. Meanwhile, any tests stated here showing massive gains, well, you're getting kernel caching. Helluva lot quicker reading from ram buffers then from disk, so the testing is *not* accurate and should be ignored. Either flush the buffers (hard), reboot (easy), or unmount the partitions holding /var/cache/edb/dep and $PORTDIR, and retry the testing. This is why some people see <60s second runs, and others see full run time. The cache rewrite stats above are (due to memory limitations) not affected by buffering, so the 35%-65% improvement *is* accurate, although the ceiling and floor will very a bit dependant on the hardware configuration (mainly IO speed, although a massively slow proc will rear it's head also). The rewrite is in .54, not .53*. If you're looking to test it, grab the patch from http://dev.gentoo.org/~ferringb/portage/2.0/3.0-cache-backport-experimental-7.patch . (In reply to comment #11) > Why not just drop the sync calls? Well, frankly that's dumb. You drop the sync > calls, a failed metadata (segfault fex) run means any ebuild that uses an eclass > now has it's cache invalidated. My cache rewrite doesn't suffer this- eclass > data is stored per ebuild (which is good for remote trees/cache dbs also). Bah, if portage segfaults or fails to update the cache in any way, then we have more to worry about, than some invalidated cache that is going to be detected anyway. (In reply to comment #12) > (In reply to comment #11) > > Why not just drop the sync calls? Well, frankly that's dumb. You drop the sync > > calls, a failed metadata (segfault fex) run means any ebuild that uses an eclass > > now has it's cache invalidated. My cache rewrite doesn't suffer this- eclass > > data is stored per ebuild (which is good for remote trees/cache dbs also). > > Bah, if portage segfaults or fails to update the cache in any way, then we have > more to worry about, than some invalidated cache that is going to be detected > anyway. To reword what Brian was saying, the current method syncs eclass data and then reads it back in only to write it out again. The new method only has one sync per file. In other words, syncing after each write or doing it in a batch at the end doesn't really make any difference (due to kernel caching) so why take the unnecessary risk? Released in 2.1_pre1. *** Bug 119255 has been marked as a duplicate of this bug. *** *** Bug 123540 has been marked as a duplicate of this bug. *** How long before 2.1 is marked as stable? (In reply to comment #17) > How long before 2.1 is marked as stable? > Well we are still adding features and changing code around in SVN, and after that at least 30 days of testing is required, I don't think anyone has any hard dates. *** Bug 129566 has been marked as a duplicate of this bug. *** |