I have seen one or more other people hit either this or something like it, and today I hit it myself: the cache occasionally gets a completely bogus entry in it (at least KEYWORDS empty for the main other case I remember, most other entries empty or incorrect for me). Killing the cache entry causes a correct one to be generated. Attaching the incorrect and correct cache entry. ferringb said he'd look into this later. emerge info for the sake of completeness. The portage version there is not entirely accurate, only portage.py and ebuild.sh are from svn, the rest is from pre4-r1. The other case I remember was with pre4-r1. Portage 2660-svn (default-linux/x86/2005.1, gcc-3.4.5, glibc-2.3.6-r2, 2.6.15-gentoo-r2-m1-restart i686) ================================================================= System uname: 2.6.15-gentoo-r2-m1-restart i686 AMD Athlon(tm) XP 2600+ Gentoo Base System version 1.12.0_pre15 dev-lang/python: 2.4.2-r1 sys-apps/sandbox: 1.2.17 sys-devel/autoconf: 2.13, 2.59-r7 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1 sys-devel/binutils: 2.16.1-r1 sys-devel/libtool: 1.5.22 virtual/os-headers: 2.6.11-r3 ACCEPT_KEYWORDS="x86 ~x86" AUTOCLEAN="yes" CBUILD="i686-pc-linux-gnu" CFLAGS="-ggdb -O2 -march=athlon-xp -pipe -fomit-frame-pointer -fweb -frename-registers" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3/share/config /usr/share/X11/xkb /usr/share/config /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/splash /etc/terminfo /etc/texmf/web2c /etc/env.d" CXXFLAGS="-ggdb -O2 -march=athlon-xp -pipe -fomit-frame-pointer -fweb -frename-registers -fvisibility-inlines-hidden" DISTDIR="/usr/portage/distfiles" FEATURES="autoaddcvs autoconfig candy collision-protect confcache cvs digest distlocks sandbox sfperms sign splitdebug strict stricter test userpriv usersandbox verify-rdepend" GENTOO_MIRRORS="http://ftp.easynet.nl/mirror/gentoo/" LANG="en_US.UTF-8" PKGDIR="/var/chroot/var/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/var/cvsco/gentoo-x86" PORTDIR_OVERLAY="/usr/local/portage/gnome /usr/local/portage/mzbot-overlay /usr/local/portage/private /usr/local/portage/bzr-overlay /usr/local/portage/twisted-divmod-overlay" SYNC="cvs://marienz@cvs.gentoo.org:/var/cvsroot" USE="x86 3dnow 3dnowex 3dnowext X a52 alsa ansi apache2 bonobo browserplugin bzip2 cairo canvas ccache cdr crypt dbus dri eds emacs evo evo2 faad fbcon firefox flac gif glibc-omitfp glitz glut gnome gpgme gstreamer gtk hal imap javascript jit jpeg kdeenablefinal kqemu libcaca mad maildir md5sum mikmod mmx mmx2 mmxext mng mono mozdevelop moznomail mozsvg mpeg nautilus ncurses nethack network new-login nfs nls nntp no-helpbrowser no-old-linux nonfsv4 nptl nptlonly nsplugin ogg oggvorbis opengl pam png preview-latex python qemu-fast quicktime readline real rtc sdl softmmu spell splash sse ssl startup-notification svg symlink tetex theora truetype truetype-fonts unicode userlocales vorbis win32codecs xinerama xml2 xscreensaver xv zlib elibc_glibc kernel_linux userland_GNU video_cards_ati video_cards_fglrx" Unset: ASFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, LC_ALL, LDFLAGS, LINGUAS, MAKEOPTS
Created attachment 79022 [details] Bogus cache entry
Created attachment 79023 [details] sane cache entry
auxdbkey locking most likely is at fault, fix this by using fd_pipes to feed the keys back to python side. need to find out why locking is screwing up here also; it may not be the fault, but the auxdbkey pipe changed mentioned above is required for systems where locking sucks.
Locally generated cache from ebuild.sh containts UNUSED_* keys. The lack of UNUSED_* keys in the bogus entry suggests that the keys (maybe not the corruption) originated in the pre-generated rsync cache. On a side note, the aux_db_key_temp locking seems to work fine (apparently not the cause of this corruption, at least).
(In reply to comment #4) > Locally generated cache from ebuild.sh containts UNUSED_* keys. The lack of > UNUSED_* keys in the bogus entry suggests that the keys (maybe not the > corruption) originated in the pre-generated rsync cache. Doubt that's the cause. If it were, there would be *deterministic* failings here, which there aren't. Thus, indicative that it's a race. > On a side note, the > aux_db_key_temp locking seems to work fine (apparently not the cause of this > corruption, at least). The fact that killing the entry, thus triggering another run of it makes me think auxdb locking is *not* working for those afflicted.
opteron246 ~ # rm -rf /var/cache/edb/dep/* opteron246 ~ # runit() { python -c 'import portage; pdb = portage.portdb; z = [pdb.aux_get(y, []) for y in portage.flatten([pdb.xmatch("match-all", x) for x in pdb.cp_all()])]' ; } opteron246 ~ # runit & runit [1] 8276 waiting for lock on /var/cache/edb/dep/aux_db_key_temp.portage_lockfile Traceback (most recent call last): File "<string>", line 1, in ? File "/usr/lib/portage/pym/portage.py", line 4933, in aux_get mydata[auxdbkeys[x]] = mylines[x] IndexError: list index out of range aux_get(): (1) Error in app-accessibility/dasher-3.2.18 ebuild. Check for syntax error or corruption in the ebuild. (--debug) Traceback (most recent call last): File "<string>", line 1, in ? File "/usr/lib/portage/pym/portage.py", line 4924, in aux_get raise KeyError KeyError [1]+ Exit 1 runit
(In reply to comment #5) > Thus, indicative that it's a race. I fixed one race (committed in revision 2712) and Jason's test now works for me. It looks like there may be some more issues and/or room for improvement in portage_locks, so I'll be looking at it some more...
Created attachment 79816 [details] test case which detects bugs in portage_locks I've written a script which shows that portage_locks is still buggy.
@zac: str, not repr Aside from that, another race exists in unlock. The logic involved there looks pretty flawed; one race is unlocking then removing the lockfile (thus invalidating the lock); even if that's corrected via unlinking the file then unlocking, it leaves a race in place for a locker to have the fd open, gain the lock bound to the unlinked inode. So... are we really bound to hardlink locking? If we drop that, I've got code that covers the the exclussive locking...
(In reply to comment #9) > Aside from that, another race exists in unlock. I've commited a minimal fix for unlockfile() in revision 2714. The patch removes a myfd != None test because there are no cases in portage_locks where myfd will be assigned None. Testing shows that the locking seems to work much better now. The portage_locks code still leaves something to be desired, but at least this test case is working for now (and these fixes should cause no regressions).
(In reply to comment #10) > removes a myfd != None test because there are no cases in portage_locks where > myfd will be assigned None. Well, that was a bad assumption (fixed in revision 2715), but I'm pretty confident there are no regressions now.
race still exists- track what lockfile and unlockfile do when wantnewlockfile=1 is enabled; during the unlock call it'll still remove the lockfile after unlocking, thus invalidating the lock (essentially). Removal of the lock isn't an issue, the fact the locker doesn't protect itself is.
(In reply to comment #12) > the locker doesn't protect itself I accounted for that by checking os.fstat(myfd).st_nlink _after_ the lock has been acquired (in both lockfile and unlockfile). This ensures that a lock is held on a file that is actually linked to the filesystem and therefore visible by competing processes so that they can cooperate (a "good lock"). As long as processes cooperate and get a "good lock" before they do anything, it's safe for the process holding the "good lock" to do anything it wants. If a process were to unlink a file that another process had a "good lock" on, then it wouldn't be cooperating.
Released in 2.1_pre5-r1.
*** Bug 98446 has been marked as a duplicate of this bug. ***