121819 – race issues with lockfile() resulting in metadata cache corruption

Bug 121819 - race issues with lockfile() resulting in metadata cache corruption

Summary: race issues with lockfile() resulting in metadata cache corruption

Status:	RESOLVED FIXED

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Core (show other bugs)
Hardware:	All Linux

Importance:	High normal (vote)
Assignee:	Portage team

URL:
Whiteboard:
Keywords:	InVCS

Duplicates (1):	98446 (view as bug list)
Depends on:
Blocks:	115839
	Show dependency tree

Reported:	2006-02-06 05:40 UTC by Marien Zwart (RETIRED)
Modified:	2006-07-10 10:48 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Bogus cache entry (seahorse-0.8.txt,260 bytes, text/plain) 2006-02-06 05:41 UTC, Marien Zwart (RETIRED)	Details
sane cache entry (seahorse-0.8-new.txt,1.94 KB, text/plain) 2006-02-06 05:41 UTC, Marien Zwart (RETIRED)	Details
test case which detects bugs in portage_locks (lock_race_test.py,1.22 KB, text/plain) 2006-02-14 20:21 UTC, Zac Medico	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marien Zwart (RETIRED) gentoo-dev

2006-02-06 05:40:16 UTC

I have seen one or more other people hit either this or something like it, and today I hit it myself: the cache occasionally gets a completely bogus entry in it (at least KEYWORDS empty for the main other case I remember, most other entries empty or incorrect for me). Killing the cache entry causes a correct one to be generated. Attaching the incorrect and correct cache entry. ferringb said he'd look into this later.

emerge info for the sake of completeness. The portage version there is not entirely accurate, only portage.py and ebuild.sh are from svn, the rest is from pre4-r1. The other case I remember was with pre4-r1.

Portage 2660-svn (default-linux/x86/2005.1, gcc-3.4.5, glibc-2.3.6-r2, 2.6.15-gentoo-r2-m1-restart i686)
=================================================================
System uname: 2.6.15-gentoo-r2-m1-restart i686 AMD Athlon(tm) XP 2600+
Gentoo Base System version 1.12.0_pre15
dev-lang/python:     2.4.2-r1
sys-apps/sandbox:    1.2.17
sys-devel/autoconf:  2.13, 2.59-r7
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1
sys-devel/binutils:  2.16.1-r1
sys-devel/libtool:   1.5.22
virtual/os-headers:  2.6.11-r3
ACCEPT_KEYWORDS="x86 ~x86"
AUTOCLEAN="yes"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-ggdb -O2 -march=athlon-xp -pipe -fomit-frame-pointer -fweb -frename-registers"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3/share/config /usr/share/X11/xkb /usr/share/config /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/gconf /etc/splash /etc/terminfo /etc/texmf/web2c /etc/env.d"
CXXFLAGS="-ggdb -O2 -march=athlon-xp -pipe -fomit-frame-pointer -fweb -frename-registers -fvisibility-inlines-hidden"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoaddcvs autoconfig candy collision-protect confcache cvs digest distlocks sandbox sfperms sign splitdebug strict stricter test userpriv usersandbox verify-rdepend"
GENTOO_MIRRORS="http://ftp.easynet.nl/mirror/gentoo/"
LANG="en_US.UTF-8"
PKGDIR="/var/chroot/var/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/var/cvsco/gentoo-x86"
PORTDIR_OVERLAY="/usr/local/portage/gnome /usr/local/portage/mzbot-overlay /usr/local/portage/private /usr/local/portage/bzr-overlay /usr/local/portage/twisted-divmod-overlay"
SYNC="cvs://marienz@cvs.gentoo.org:/var/cvsroot"
USE="x86 3dnow 3dnowex 3dnowext X a52 alsa ansi apache2 bonobo browserplugin bzip2 cairo canvas ccache cdr crypt dbus dri eds emacs evo evo2 faad fbcon firefox flac gif glibc-omitfp glitz glut gnome gpgme gstreamer gtk hal imap javascript jit jpeg kdeenablefinal kqemu libcaca mad maildir md5sum mikmod mmx mmx2 mmxext mng mono mozdevelop moznomail mozsvg mpeg nautilus ncurses nethack network new-login nfs nls nntp no-helpbrowser no-old-linux nonfsv4 nptl nptlonly nsplugin ogg oggvorbis opengl pam png preview-latex python qemu-fast quicktime readline real rtc sdl softmmu spell splash sse ssl startup-notification svg symlink tetex theora truetype truetype-fonts unicode userlocales vorbis win32codecs xinerama xml2 xscreensaver xv zlib elibc_glibc kernel_linux userland_GNU video_cards_ati video_cards_fglrx"
Unset:  ASFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, LC_ALL, LDFLAGS, LINGUAS, MAKEOPTS

Comment 1 Marien Zwart (RETIRED) gentoo-dev

2006-02-06 05:41:13 UTC

Created attachment 79022 [details]
Bogus cache entry

Comment 2 Marien Zwart (RETIRED) gentoo-dev

2006-02-06 05:41:41 UTC

Created attachment 79023 [details]
sane cache entry

Comment 3 Brian Harring (RETIRED) gentoo-dev

2006-02-06 05:46:17 UTC

auxdbkey locking most likely is at fault, fix this by using fd_pipes to feed the keys back to python side.

need to find out why locking is screwing up here also; it may not be the fault, but the auxdbkey pipe changed mentioned above is required for systems where locking sucks.

Comment 4 Zac Medico gentoo-dev

2006-02-13 16:53:40 UTC

Locally generated cache from ebuild.sh containts UNUSED_* keys.  The lack of UNUSED_* keys in the bogus entry suggests that the keys (maybe not the corruption) originated in the pre-generated rsync cache.  On a side note, the aux_db_key_temp locking seems to work fine (apparently not the cause of this corruption, at least).

Comment 5 Brian Harring (RETIRED) gentoo-dev

2006-02-13 22:18:52 UTC

(In reply to comment #4)
> Locally generated cache from ebuild.sh containts UNUSED_* keys.  The lack of
> UNUSED_* keys in the bogus entry suggests that the keys (maybe not the
> corruption) originated in the pre-generated rsync cache.

Doubt that's the cause.  If it were, there would be *deterministic* failings here, which there aren't.  Thus, indicative that it's a race.

> On a side note, the
> aux_db_key_temp locking seems to work fine (apparently not the cause of this
> corruption, at least).
The fact that killing the entry, thus triggering another run of it makes me think auxdb locking is *not* working for those afflicted.

Comment 6 Jason Stubbs (RETIRED) gentoo-dev

2006-02-14 03:41:40 UTC

opteron246 ~ # rm -rf /var/cache/edb/dep/*
opteron246 ~ # runit() { python -c 'import portage; pdb = portage.portdb; z = [pdb.aux_get(y, []) for y in portage.flatten([pdb.xmatch("match-all", x) for x in pdb.cp_all()])]' ; }
opteron246 ~ # runit & runit
[1] 8276
waiting for lock on /var/cache/edb/dep/aux_db_key_temp.portage_lockfile
Traceback (most recent call last):
  File "<string>", line 1, in ?
  File "/usr/lib/portage/pym/portage.py", line 4933, in aux_get
    mydata[auxdbkeys[x]] = mylines[x]
IndexError: list index out of range

aux_get(): (1) Error in app-accessibility/dasher-3.2.18 ebuild.
               Check for syntax error or corruption in the ebuild. (--debug)

Traceback (most recent call last):
  File "<string>", line 1, in ?
  File "/usr/lib/portage/pym/portage.py", line 4924, in aux_get
    raise KeyError
KeyError
[1]+  Exit 1                  runit

Comment 7 Zac Medico gentoo-dev

2006-02-14 14:43:38 UTC

(In reply to comment #5)
>  Thus, indicative that it's a race.

I fixed one race (committed in revision 2712) and Jason's test now works for me.  It looks like there may be some more issues and/or room for improvement in portage_locks, so I'll be looking at it some more...

Comment 8 Zac Medico gentoo-dev

2006-02-14 20:21:25 UTC

Created attachment 79816 [details]
test case which detects bugs in portage_locks

I've written a script which shows that portage_locks is still buggy.

Comment 9 Brian Harring (RETIRED) gentoo-dev

2006-02-15 01:50:32 UTC

@zac: str, not repr
Aside from that, another race exists in unlock.

The logic involved there looks pretty flawed; one race is unlocking then removing the lockfile (thus invalidating the lock); even if that's corrected via unlinking the file then unlocking, it leaves a race in place for a locker to have the fd open, gain the lock bound to the unlinked inode.

So... are we really bound to hardlink locking?  If we drop that, I've got code that covers the the exclussive locking...

Comment 10 Zac Medico gentoo-dev

2006-02-15 10:03:15 UTC

(In reply to comment #9)
> Aside from that, another race exists in unlock.

I've commited a minimal fix for unlockfile() in revision 2714.  The patch removes a myfd != None test because there are no cases in portage_locks where myfd will be assigned None.  Testing shows that the locking seems to work much better now.

The portage_locks code still leaves something to be desired, but at least this test case is working for now (and these fixes should cause no regressions).

Comment 11 Zac Medico gentoo-dev

2006-02-15 10:45:41 UTC

(In reply to comment #10)
> removes a myfd != None test because there are no cases in portage_locks where
> myfd will be assigned None.

Well, that was a bad assumption (fixed in revision 2715), but I'm pretty confident there are no regressions now.

Comment 12 Brian Harring (RETIRED) gentoo-dev

2006-02-15 15:53:06 UTC

race still exists- track what lockfile and unlockfile do when wantnewlockfile=1 is enabled; during the unlock call it'll still remove the lockfile after unlocking, thus invalidating the lock (essentially).

Removal of the lock isn't an issue, the fact the locker doesn't protect itself is.

Comment 13 Zac Medico gentoo-dev

2006-02-15 21:23:38 UTC

(In reply to comment #12)
> the locker doesn't protect itself

I accounted for that by checking os.fstat(myfd).st_nlink _after_ the lock has been acquired (in both lockfile and unlockfile).  This ensures that a lock is held on a file that is actually linked to the filesystem and therefore visible by competing processes so that they can cooperate (a "good lock").

As long as processes cooperate and get a "good lock" before they do anything, it's safe for the process holding the "good lock" to do anything it wants.  If a process were to unlink a file that another process had a "good lock" on, then it wouldn't be cooperating.

Comment 14 Zac Medico gentoo-dev

2006-02-26 14:58:42 UTC

Released in 2.1_pre5-r1.

Comment 15 Zac Medico gentoo-dev

2006-07-10 10:48:38 UTC

*** Bug 98446 has been marked as a duplicate of this bug. ***