It seems there is a race in glibc's semaphore implementation, specifically
around sem_wait() and sem_post() interaction.
I'm not sure since I'm not that handy in asm and futex implementation, but the
attached program reproduces it reliably when ran under valgrind, after a very
long time (it's much easier to reproduce with HT/SMP, I haven't reproduced it
on UP since I don't have any available for extended testing at hand). I wrote
it after I got those messages when testing in a library I wrote called
"libfilo", so it's not a theorethical weird case.
The test is quite simple: it has two threads, A and B. A creates a semaphore,
lets B know he can post on it, sem_wait()s on it, and then free()s it; while B
waits for A's notifiation and when he receives it, he sem_post()s on the
semaphore A has just created. They keep doing this forever, printing the
number of iteration periodically, just for showing.
When run under valgrind, after some time (on my P4 2.8GHz with HT it varies
between 30 mins and 2 hours) an error is shown, specifically "Syscall param
futex(futex) points to unaddressable byte(s)". The detailed valgrind output is
attached to avoid too much text here. After that, everything seems to keep on
working just fine.
My wild guess is that sem_post() in B wakes A and then perform the futex
system call. If A gets a chance to run before the syscall is performed (very
small chance since probably the waking up and the futex call are close), it
can free the semaphore (and the glibc internal futex that goes with it), and
then when B runs again and perform the syscall, it fails because the semaphore
he's doing the futex for has been free()d. But, as I said, it's just a guess.
The details on the machines and versions are below under "Additional
Please let me know if I can help you with anything else.
I'm reporting this here instead of glibc's bugzilla because I wanted to check
Thanks a lot,
Steps to Reproduce:
1. Download the "semtest.c" attachment
2 [details]. Compile with "gcc -Wall -g semtest.c -lpthread -o semtest" or similar
3. Run with "valgrind --tool=memcheck ./semtest", wait some time (can take a
long time, between 30m and 3h depending on your machine and luck)
4. See the valgrind warning
I've reproduced it several times under my box, a 2.8Ghz P4 with HT and 512Mb ram.
It's running Gentoo with glibc 22.214.171.12440808-r1 using nptl, and gcc
126.96.36.19950130-r1. I don't have any fancy CFLAGS, just "-march=pentium4 -O3
-pipe". I've also got confirmation from a friend running Gentoo, same gcc, glibc
188.8.131.5241102-r1, on a dual Xeon.
Created attachment 59404 [details]
Created attachment 59405 [details]
I can't reproduce this anymore. Let me know if its still an issue for you.
I can reproduce this on an x86_64 Opteron machine
(on RHEL5.3, not gentoo, but it's surely the same bug).
The most reliable way I've found
is to modify Alberto Bertogli's original test program slightly
(fill the deceased semaphore memory with 1's before freeing it,
and check the return value of sem_post),
and then carefully stop and resume the threads in gdb,
using the input shown below.
This is on RHEL5.3,
with /lib64/libpthread-2.5.so from rpm package glibc-2.5-34.el5_3.1,
and gcc from rpm package gcc-4.1.2-44.el5,
using gdb 7.1 or 7.2
(gdb 7.0.1 and earlier fail with a supposed syntax error
on the "b *(sem_post+4) thread 3").
% gdb ./semtest
# per http://sourceware.org/gdb/onlinedocs/gdb/Non_002dStop-Mode.html ...
# Enable the async interface.
set target-async 1
# If using the CLI, pagination breaks non-stop.
set pagination off
# Finally, turn it on!
set non-stop on
# thread 2 stops in waiter
# thread 3 stops in poster
b sem_wait thread 2
# thread 2 (waiter) stops at the beginning of sem_wait(varsem)
b *(sem_post+4) thread 3
# thread 3 (poster) stops at sem_post+4,
# after incrementing varsem->value (first 32-bit word)
# but before looking at varsem->nwaiters (second 32-bit word)
b free thread 2
# thread 2 (waiter) blasts through the sem_wait without blocking,
# calls sem_destroy(varsem),
# trashes the memory,
# and stops at the beginning of free
# thread 3 (poster) resumes in the middle of sem_post,
# looks at varsem->nwaiters and sees it's nonzero (trash)
# so it makes the FUTEX_WAKE syscall which returns EINVAL,
# the program exits with error message
# "sem_post() in poster: Invalid argument"
Here is the actual gdb session transcript:
% gdb7.1 ./semtest
GNU gdb (GDB) 7.1
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
Reading symbols from /home/dhatch/tmp/semtest...done.
(gdb) # per http://sourceware.org/gdb/onlinedocs/gdb/Non_002dStop-Mode.html ...
(gdb) # Enable the async interface.
(gdb) set target-async 1
(gdb) # If using the CLI, pagination breaks non-stop.
(gdb) set pagination off
(gdb) # Finally, turn it on!
(gdb) set non-stop on
(gdb) b waiter
Breakpoint 1 at 0x400874: file semtest.c, line 25.
(gdb) b poster
Breakpoint 2 at 0x400990: file semtest.c, line 58.
Starting program: /home/dhatch/tmp/semtest
[Thread debugging using libthread_db enabled]
[New Thread 0x40800940 (LWP 11082)]
Breakpoint 1, waiter (unused=0x0) at semtest.c:25
25 unsigned int count = 0;
(gdb) [New Thread 0x41001940 (LWP 11083)]
Breakpoint 2, poster (unused=0x0) at semtest.c:58
58 unsigned int count = 0;
(gdb) t 2
[Switching to thread 2 (Thread 0x40800940 (LWP 11082))]#0 waiter (unused=0x0) at semtest.c:25
25 unsigned int count = 0;
(gdb) b sem_wait thread 2
Breakpoint 3 at 0x3f1040c670
Breakpoint 3, 0x0000003f1040c670 in sem_wait () from /lib64/libpthread.so.0
(gdb) t 3
[Switching to thread 3 (Thread 0x41001940 (LWP 11083))]#0 poster (unused=0x0) at semtest.c:58
58 unsigned int count = 0;
(gdb) b *(sem_post+4) thread 3
Breakpoint 4 at 0x3f1040c854
Breakpoint 4, 0x0000003f1040c854 in sem_post () from /lib64/libpthread.so.0
Dump of assembler code for function sem_post:
0x0000003f1040c850 <+0>: lock addl $0x1,(%rdi)
=> 0x0000003f1040c854 <+4>: cmpq $0x0,0x8(%rdi)
0x0000003f1040c859 <+9>: je 0x3f1040c874 <sem_post+36>
0x0000003f1040c85b <+11>: mov $0xca,%eax
0x0000003f1040c860 <+16>: mov $0x1,%esi
0x0000003f1040c865 <+21>: or 0x4(%rdi),%esi
0x0000003f1040c868 <+24>: mov $0x1,%edx
0x0000003f1040c86d <+29>: syscall
0x0000003f1040c86f <+31>: test %rax,%rax
0x0000003f1040c872 <+34>: js 0x3f1040c877 <sem_post+39>
0x0000003f1040c874 <+36>: xor %eax,%eax
0x0000003f1040c876 <+38>: retq
0x0000003f1040c877 <+39>: mov 0x209722(%rip),%rdx # 0x3f10615fa0
0x0000003f1040c87e <+46>: movl $0x16,%fs:(%rdx)
0x0000003f1040c885 <+53>: or $0xffffffffffffffff,%eax
0x0000003f1040c888 <+56>: retq
End of assembler dump.
(gdb) t 2
[Switching to thread 2 (Thread 0x40800940 (LWP 11082))]#0 0x0000003f1040c670 in sem_wait () from /lib64/libpthread.so.0
(gdb) b free thread 2
Breakpoint 5 at 0x400720
Breakpoint 5, 0x0000000000400720 in free@plt ()
(gdb) t 3
[Switching to thread 3 (Thread 0x41001940 (LWP 11083))]#0 0x0000003f1040c854 in sem_post () from /lib64/libpthread.so.0
sem_post() in poster: Invalid argument
[Thread 0x41001940 (LWP 11083) exited]
[Thread 0x40800940 (LWP 11082) exited]
Program exited with code 01.
I'll attach the modified semtest.c separately.
Created attachment 269761 [details]
the original test program with a couple of modifications:
fills the deceased semaphore memory with 1's before freeing it,
and checks the return value from sem_post().
you're using an ancient kernel/glibc neither of which we support. file a bug with redhat.
Thank you, I'll do that.
I was adding in response to Mark Loeser's request for a test case,
but I can see that this was probably never the right place for this bug to be reported in the first place.
I notice the status of this keeps getting changed to FIXED (which seems unlikely)
rather than WORKSFORME or INVALID--
INVALID seems most appropriate at this point.
the original test and your updated test do not cause problems on updated glibc. thus "FIXED" sounds appropriate. bugs that only exist in old glibc versions do not qualify as "INVALID" or "WONTFIX".
SpanKY, do you have any evidence that this bug is fixed in any version of anything?
Have you actually tried it?
I don't believe it's fixed--
I've just downloaded the current glibc source as of today
and compiled x86_64/sem_post.S and x86_64/sem_wait.S from it,
and I get the exact same problem; the only difference is
that I had to put the breakpoint at *(sem_post+18) instead of *(sem_post+4)
because some unrelated stuff apparently got added to the beginning of the function
since my ancient version of glibc.
And I believe my ancient kernel isn't relevant because by the time it gets
into the kernel code,
sem_post has already done the bad thing (namely accessed no-longer-valid
memory, which may have been already freed and/or even munmapped).
Again-- if you can't reproduce something despite your valiant efforts,
WORKSFORME might be appropriate.
If you're not interested because it's someone else's bug and doesn't belong
on this list (which I believe is the case here), INVALID might be appropriate.
Calling something FIXED when you have no evidence that it is indeed fixed is misleading.
of course i ran both tests and they both ran fine. just like other Gentoo maintainers.
so again, if you have a problem on your *redhat* system, file a bug with *redhat*. or if you have a problem with *upstream glibc*, file a bug with *upstream glibc*. i dont know why you're posting here at all considering Gentoo is neither redhat nor upstream glibc.
To answer your rhetorical non-question,
I posted the new information here because the new information
is relevant to this already-posted bug, which you guys (specifically Mark Loeser)
expressed interest in and were having a hard time reproducing.
I was actually trying to be helpful, though I never expected you guys
to be the ones to fix it.
The bug is not fixed, and your brief dismissal
leads me to suspect that you didn't look at what you were doing
closely enough for you to even have an opinion on it.
"They both ran fine" isn't enough;
as already stated, the bug is extremely hard to reproduce and
you need a debugger to do it reliably...
and, since the sources have changed slightly, you'd need to
understand what you're doing and set the breakpoint in a slightly different place
from where I set mine.
Of course you're under no obligation to care,
it's just annoying me that you keep calling it FIXED when it's clear that it's not.
I've filed the bug with redhat/glibc.
For reference, it's here: http://sources.redhat.com/bugzilla/show_bug.cgi?id=12674