Bug 108412

Summary:	emerge metadata is slow: "cpickle" cache problem?
Product:	Portage Development	Reporter:	Georgi Georgiev <chutz+bugs.gentoo.org>
Component:	Core	Assignee:	Portage team <dev-portage>
Status:	RESOLVED FIXED
Severity:	normal	CC:	aaron123456789, betelgeuse, bugs+gentoo, bugs, crusaderky, dan, frank_meier, jrmalaq, kalium, mal, marian, mlspamcb, orionbelt2, radek, rdalek1967, sirjoltalot
Priority:	High
Version:	2.1
Hardware:	All
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---
Bug Depends on:
Bug Blocks:	115839

Description Georgi Georgiev 2005-10-07 10:58:05 UTC

As many people have realized, "Updating cache" has become slower and slower. On
one of my machines, when it gets stuck around 50%, I do the following.

# ps -a (to get the PID of the emerge process)
# ls -l /proc/$thepid/fd/
and I see that /var/cache/edb/dep/usr/portage/kde-base-eclass.cpickle is one of
the files opened by the emerge process.

I start repeatedly running "ls -l" on that file, and I see that while its size
grows, it also is at 0 bytes more often than not. I.e, the sizes that I am able
to observe are:

0 0 100K 0 0 0 0 120K 0 0 0 n>120K 0 0 0 m>n 0 0 0 ....

etc.

A conclusion that I make from this observation is that the cpickle is truncated
and recreated for every package in kde-base that gets its metadata cached. I
have also observed a similar behavior with "getbinpkg": the cpickle cache of the
remote binary packages is truncated and regenerated for every remote binary package.

These are all assumptions that I made based on the "truncate, grow, truncate,
grow" observation. They could be very wrong, but it certainly seems that
dropping cpickle and using a standard db may offer a significant improvement.
After all, one cannot append data to a dumped cpickle, and recreating
kde-base-eclass.cpickle a total of... 974 times, especially as it grows up to
450K is far from optimal.

This is with 2.0.51.22-r3. I haven't tried the newer versions, but a quick check
in #gentoo-portage confirmed that developers are seeing this behavior with (I
assume) the more bleeding-edge versions.

Comment 1 Brian Harring (RETIRED) gentoo-dev

2005-10-07 11:30:08 UTC

eclass cache is a full write every time around for an update.

I had a patch that sort of fixed this, although the results of it weren't as
impressive as I'd like.

Basically... you adjust the syncing rate of the eclass cache.  Sync to disk once
every N times rather then every eclass update.  Downside is that a traceback
during metadata transfer would result in everything since the last eclass cache
syncing being invalidated.

Additionally, this isn't a new issue.  It's existed for quite some time; all the
delayed syncing does is cut down on io and a bit of proc usage.  Slow systems,
an improvement, but I don't think it's going to give as major of a gain as
people expect.

Comment 2 Brian Harring (RETIRED) gentoo-dev

2005-10-07 11:40:07 UTC

Additionally... the cache rewrite already address the seperated eclass/ebuild
caching (2 files) issue, since eclass data sits in the repository cache.

Comment 3 Georgi Georgiev 2005-10-07 12:21:02 UTC

Am I to understand that this whole issue will be solved by that cache rewrite.

Anyway, just to illustrate the problem with cpickle, I did the following.

cd /usr/lib/portage/pym
cp portage_db_flat.py poratage_db_flatany.py
fix the db_flatany to actually work as an alternative to cpickle, anydbm, etc...

--- portage_db_flat.py  2005-10-08 01:55:36.000000000 +0900
+++ portage_db_flatany.py       2005-10-08 03:57:25.000000000 +0900
@@ -6,6 +6,7 @@
 import types
 import os
 import stat
+import cPickle
 
 import portage_db_template
 
@@ -55,6 +56,11 @@
 
 
                # nuke the newlines right off the batt.
+               data = myf.read()
+               myval = cPickle.loads(data)
+               myf.close()
+               return myval;
+
                data = myf.read().splitlines()
                mdict = {}
 
@@ -86,12 +92,11 @@
 
                update_fp = self.fullpath + ".update." + str(os.getpid()) + "."
+ key
                myf = open(update_fp,"w")
-               myf.writelines( [ val[x] +"\n" for x in self.dbkeys] )
+               myf.writelines( cPickle.dumps(val,cPickle.HIGHEST_PROTOCOL) )
                myf.close()
 
                os.chown(update_fp, self.uid, self.gid)
                os.chmod(update_fp, 0664)
-               os.utime(update_fp, (-1,long(val["_mtime_"])))
                os.rename(update_fp, self.fullpath+key)
 
        def del_key(self,key):

The patch is left as short as possible for illustration purposes.

Now perform a few tests with either of

eclass_cache.dbmodule  = "portage_db_cpickle.database"
eclass_cache.dbmodule  = "portage_db_flatany.database"

in /etc/portage/modules

rm -rf /var/cache/edb/dev/usr/portage before every "time emerge metadata"

Results:


portage_db_cpickle: ~ user 1m18s (5 tries, all within 1 second)
portage_db_flatany: ~ user 0m11s (5 tries, all within 1 second)
portage_db_cpickle: real between 1m40s and 2m22s.
portage_db_flatany: real between 22s and 1m50s.

Should mentioned that plenty of disc activity is observed with flatany.

And for completeness:

$ du -hsc *-eclass/ | tail -n1
60M     total
$ du -hsc $-eclass.cpickle | tail -n1
2.6M    total

Still, it is obvious that on my machine, cpickle cannot theoretically go under
1m18s, while flatany can and did. So, is this enough of an illustration about
how non-optimal cpickle is? At least in the case of the -eclass cache.

I'll wait for the new portage with the new caching.

Comment 4 Brian Harring (RETIRED) gentoo-dev

2005-10-07 13:09:06 UTC

cpickle isn't the issue as much as the abuse of it.

See my comment from above about the fact we're forcing a rewrite everytime; your
reshuffling of the code demonstrates it- you're cpickling just the new additions
to the eclass cache, not the whole eclass-cache.

Stupid question, but anyone got a reason the cache rewrite can't be backported
to stable?
Requires adjusting portdbapi.aux_get calls, breaks any custom caches people
have, and won't play nice with anyone who is sharing their local cache with
others via nfs (bad idea anyways), and finally tweaking the config code... but
it's also what's going to be used long term, and nukes the cpickling issue.

jason, thoughts?

Comment 5 Brian Harring (RETIRED) gentoo-dev

2005-10-11 11:06:32 UTC

backport of the 3x cache to stable is available (and needs testing) via
http://dev.gentoo.org/~ferringb/potage/2.0/3.0-cache-backport.patch

Comment 6 Toralf Förster gentoo-dev

2005-10-22 10:08:52 UTC

typo fixed : http://dev.gentoo.org/~ferringb/portage/2.0/3.0-cache-backport.patch

Comment 7 Michael 2005-11-26 02:08:25 UTC

I respectfully disagree with you Brian, removing the sync calls completely [i.e
letting .close, either when it finishes or when you do ctrl-c, do the sync once]
and resetting the variable that sync checks to see if a sync is needed in the
sync function speeds it up no end.

Who cares if it dies and you lose data? That'd be a bug that needs fixing, not a
reason to write bad code. It's a cache, not vital data - proven by the fact that
to test removing the sync's sped it up I've been deleting the files - what's the
point in syncing every write something that you can delete and it recreates it
anyway?

Besides, if it dies then the metadata for every category / ebuild after the one
it crashed on would need to be done. Same as if you just hit ctrl-c rather than
wait for the thing that's trying too hard to delete and write and delete and
write and delete and write the same data pointlessly hundreds of times] - in
short, if it crashes the previous syncs didn't help.

I can't see it being just slow systems either because it's painfully slow even
though the biggest eclass file [kde-base, the one that everyone gets stuck at
50% on] is only about 699k - which is absolute peanuts to write to an old disk.

Ergo it seems nothing to do with a slow computer - whatever makes it slow must
be either crud python's cpickle does to write that data or the fact that the the
data is deleted and written out multiple times.

Unless an AMD64 3000 with sata 7200 disks counts as slow?

I know work is being done with backport patches et al but calling sync every
time is too ugly to live anyway and deserves to die on principle.

Perhaps I'm confused? I thought 2.0.53_rc7 had the cache speed up in? I upgraded
to that because of this slow cache around 50%. But that made no difference.
After looking at why, it seems because the unnecessary syncs afaict [and I've
since found this bug]

So, if 2.0.53_rc7 is supposed to speed up the cache, then it doesn't - at least
for me - because what makes it slow is still in there.

If it isn't, then fair enough, some better changes are on the way perhaps, but
removing the sync calls here has made the so-called slow update cache go away -
i.e afaict, that is what made it slow, so please consider removing the syncs in
the interim, it's a tiny patch and a huge speed up afaict.

Comment 8 Michael 2005-11-26 03:18:36 UTC

To illustrate, remove the kde-base cpickle file [rest of metadata is up to date]
michael ~ # rm /var/cache/edb/dep/usr/portage/kde-base-eclass.cpickle
michael ~ # time emerge --metadata
skipping sync

>>> Updating Portage cache:  100%

real    5m46.717s
user    5m25.056s
sys     0m21.217s

Over 5 minutes of peeing into the wind, burning cpu, not i/o note - all around
50%-53% of the update

Next, remove the file again, don't bother updating metadata and see how long
emerge -pvuD takes
michael ~ # rm /var/cache/edb/dep/usr/portage/kde-base-eclass.cpickle
michael ~ # time emerge -pvuD world

These are the packages that I would merge, in order:

Calculating world dependencies ...done!
[ebuild     U ] kde-base/kde-i18n-3.4.3 [3.4.1] -arts -debug +kdeenablefinal
-xinerama 0 kB
...snipped other kde-3.4.3 updates....

real    0m57.776s
user    0m40.743s
sys     0m15.489s

It's slow, but it takes 5x less than creating the cache.

Now remove the explicit sync calls from portage.py
michael ~ # #patched portage.py with vi here
michael ~ # rm /var/cache/edb/dep/usr/portage/kde-base-eclass.cpickle
michael ~ # time emerge --metadata
skipping sync

>>> Updating Portage cache:  100%

real    0m20.935s
user    0m16.873s
sys     0m3.688s

Yabba dabba doo, 1500% quicker

michael ~ # time emerge -pvuD world

These are the packages that I would merge, in order:

Calculating world dependencies ...done!
[ebuild     U ] kde-base/kde-i18n-3.4.3 [3.4.1] -arts -debug +kdeenablefinal
-xinerama 0 kB
...snipped other kde-3.4.3 updates....

real    0m5.534s
user    0m4.816s
sys     0m0.496s

...and cache is worth using.

Comment 9 Alec Warner (RETIRED) archtester

2005-11-26 08:19:30 UTC

> Perhaps I'm confused? I thought 2.0.53_rc7 had the cache speed up in? I 
upgraded
> to that because of this slow cache around 50%. But that made no difference.
> After looking at why, it seems because the unnecessary syncs afaict [and I've
> since found this bug]
> 
> So, if 2.0.53_rc7 is supposed to speed up the cache, then it doesn't - at 
least
> for me - because what makes it slow is still in there.
 No, there is a patch that must be applied to rc7.
> 
> If it isn't, then fair enough, some better changes are on the way perhaps, but
> removing the sync calls here has made the so-called slow update cache go away 
-
> i.e afaict, that is what made it slow, so please consider removing the syncs 
in
> the interim, it's a tiny patch and a huge speed up afaict.

Have you done any comparisons from rc7 and -r3?  The % done has been a best 
guess at best and at worst it does this, sitting around 50% for most of the time 
instead of taking a smooth curve.

sys-apps/portage-2.0.51.22-r3
localhost edb # time emerge --metadata
>>> /var/cache/edb/dep doesn't exist, creating it...
skipping sync

>>> Updating Portage cache:  100%

 * IMPORTANT: 2 config files in /etc need updating.
 * Type emerge --help config to learn how to update config files.


 * An update to portage is available. It is _highly_ recommended
 * that you update portage now, before any other packages are updated.
 * Please do so and then update ALL of your configuration files.


real    9m2.288s
user    6m39.624s
sys     0m42.414s

sys-apps/portage-2.0.51.22
localhost portage # rm -rf /var/cache/edb/dep/
localhost portage # time emerge --metadata
>>> /var/cache/edb/dep doesn't exist, creating it...
skipping sync

>>> Updating Portage cache:  100%

 * IMPORTANT: 2 config files in /etc need updating.
 * Type emerge --help config to learn how to update config files.


 * An update to portage is available. It is _highly_ recommended
 * that you update portage now, before any other packages are updated.
 * Please do so and then update ALL of your configuration files.


real    8m49.860s
user    6m36.501s
sys     0m43.019s

13 seconds or so of difference.

There is a new cache rewrite, it will be in the next minor version, if you don't 
like it you can apply the patch for it against rc7 or write your own patch for 
the slowness.

Comment 10 Michael 2005-11-27 02:42:53 UTC

>Have you done any comparisons from rc7 and -r3?  The % done has been a best 
>guess at best and at worst it does this, sitting around 50% for most of the time 
>instead of taking a smooth curve.

? When I did the demo I've only deleted one file - and, as I said, it hangs
around 50%-53% doing kde-base.cpickle. Doing strace -p <pid> at 50-53% shows
kde-base.cpickle being written [as the first comment in the bug report states] 

I'm pretty sure the %age is because of that file being synced. So yeah, the
curve isn't smooth, but that's because it sits doing kde-base.cpickle and
because the other cpickle stuff is a lot smaller by comparison.

The comparison in my first comment was between 2.0.51.22-r3 and _rc7. 13 seconds
I wouldn't have noticed though - I was looking for something dramatic - which I
assume is what you say is coming soon. Hooray :)

Or at least I think hooray. But your end comments confuse me. What's for me to
like or dislike? Either the new 2500+ lines of code like the patch almost linked
to this bug report has the same / similar dramatic speed up that the few lines
of changes I described have, or they don't.

If it doesn't then it doesn't fix the problem in this bug report - unless, as
you say, we all write our own portage patches. Are you saying the new changes
aren't compatible with the change I described - if they are, then is doing both
going to speed it up too much? :)

Comment 11 Brian Harring (RETIRED) gentoo-dev

2005-12-01 00:11:20 UTC

Clarification of the slow down, and why I stated "cpickle isn't the issue, the
abuse of it is".  As y'all have caught on here, eclass db updates
(portage_db_cpickle) are trigger for every new consumer of that eclass.  Current
code sync's everytime, meaning it's essentially N^3, for N classes, N writes of
the class contents (as N grows).

Cache rewrite, the '2500 lines' removes the eclass db.  It's a bad design having
it anyways, what the rewrite does is centralize eclass data per cpv instead. 
The slowness (if any) is limited to the backend, rather then an a crappy portage
mandated additional db.

So... rewrite, you've got 35%-65% improvement.  65% on fresh writes, 35% on
scanning for staleness and doing no updates.

Why not just drop the sync calls?  Well, frankly that's dumb.  You drop the sync
calls, a failed metadata (segfault fex) run means any ebuild that uses an eclass
now has it's cache invalidated.  My cache rewrite doesn't suffer this- eclass
data is stored per ebuild (which is good for remote trees/cache dbs also).  

So... we're not removing the syncing.  The cache rewrite allows for
non-autocommit cache backends (sql based fex) to queue up updates, but that's a
totally different scenario from disabling all sync'ing during metadata that
you're advocating, and the affects of a segfault are pretty massively different.

Meanwhile, any tests stated here showing massive gains, well, you're getting
kernel caching.  Helluva lot quicker reading from ram buffers then from disk, so
the testing is *not* accurate and should be ignored.  Either flush the buffers
(hard), reboot (easy), or unmount the partitions holding /var/cache/edb/dep and
$PORTDIR, and retry the testing.  This is why some people see <60s second runs,
and others see full run time.

The cache rewrite stats above are (due to memory limitations) not affected by
buffering, so the 35%-65% improvement *is* accurate, although the ceiling and
floor will very a bit dependant on the hardware configuration (mainly IO speed,
although a massively slow proc will rear it's head also).

The rewrite is in .54, not .53*.  If you're looking to test it, grab the patch
from
http://dev.gentoo.org/~ferringb/portage/2.0/3.0-cache-backport-experimental-7.patch
.

Comment 12 Georgi Georgiev 2005-12-01 03:25:53 UTC

(In reply to comment #11)
> Why not just drop the sync calls?  Well, frankly that's dumb.  You drop the sync
> calls, a failed metadata (segfault fex) run means any ebuild that uses an eclass
> now has it's cache invalidated.  My cache rewrite doesn't suffer this- eclass
> data is stored per ebuild (which is good for remote trees/cache dbs also).  

Bah, if portage segfaults or fails to update the cache in any way, then we have
more to worry about, than some invalidated cache that is going to be detected
anyway.

Comment 13 Jason Stubbs (RETIRED) gentoo-dev

2005-12-01 04:32:30 UTC

(In reply to comment #12) 
> (In reply to comment #11) 
> > Why not just drop the sync calls?  Well, frankly that's dumb.  You drop 
the sync 
> > calls, a failed metadata (segfault fex) run means any ebuild that uses an 
eclass 
> > now has it's cache invalidated.  My cache rewrite doesn't suffer this- 
eclass 
> > data is stored per ebuild (which is good for remote trees/cache dbs also).   
>  
> Bah, if portage segfaults or fails to update the cache in any way, then we 
have 
> more to worry about, than some invalidated cache that is going to be 
detected 
> anyway. 
 
To reword what Brian was saying, the current method syncs eclass data and then 
reads it back in only to write it out again. The new method only has one sync 
per file. In other words, syncing after each write or doing it in a batch at 
the end doesn't really make any difference (due to kernel caching) so why take 
the unnecessary risk?

Comment 14 Jason Stubbs (RETIRED) gentoo-dev

2005-12-19 07:30:20 UTC

Released in 2.1_pre1.

Comment 15 Jakub Moc (RETIRED) gentoo-dev

2006-01-17 01:01:38 UTC

*** Bug 119255 has been marked as a duplicate of this bug. ***

Comment 16 Jakub Moc (RETIRED) gentoo-dev

2006-02-20 14:45:35 UTC

*** Bug 123540 has been marked as a duplicate of this bug. ***

Comment 17 Alex Buell 2006-03-10 02:30:15 UTC

How long before 2.1 is marked as stable?

Comment 18 Alec Warner (RETIRED) archtester

2006-03-10 09:03:53 UTC

(In reply to comment #17)
> How long before 2.1 is marked as stable?
> 

Well we are still adding features and changing code around in SVN, and after that at least 30 days of testing is required, I don't think anyone has any hard dates.

Comment 19 Jakub Moc (RETIRED) gentoo-dev

2006-04-11 02:09:14 UTC

*** Bug 129566 has been marked as a duplicate of this bug. ***