On sparc32-SMP (but apparently not on sparc64-SMP), if I have a multithreaded program containing two threads simultaneously computing 'd=cos(d),' say, I pretty quickly get a Segment Fault (sorry, I haven't figured out where). 1. It seems that if I have one, three, four, ... threads, the failure never (or very seldom) occurs. 2. If instead of 'd=cos(d)' I use 'd=random()' the problem never (or seldom) occurs 3. I'll attach the actual program to a followup; I don't see a way to attach it to an original. Reproducible: Always Steps to Reproduce: 1.gcc -O2 -o spin spin.c -lm -lpthread 2.spin 2 M 3.Wait a few seconds Actual Results: Seg Fault Expected Results: Redline both CPUs forever. fmccor@dragonfly:SIMULATION/DASSF/APP [377]% emerge info Portage 2.0.48-r1 (default-sparc-1.4, gcc-3.2.2, glibc-2.3.1-r4) ================================================================= System uname: 2.4.21-sparc-r1 sparc sun4m GENTOO_MIRRORS="http://gentoo.oregonstate.edu http://www.ibiblio.org/pub/Linux/distributions/gentoo" CONFIG_PROTECT="/etc /var/qmail/control /usr/share/config /usr/kde/2/share/config /usr/kde/3/share/config /usr/X11R6/lib/X11/xkb:/usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/" CONFIG_PROTECT_MASK="/etc/gconf /etc/env.d" PORTDIR="/usr/portage" DISTDIR="/usr/portage/distfiles" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR_OVERLAY="" USE="sparc crypt fbcon imlib jpeg mikmod motif mpeg ncurses nls png qt spell truetype xv zlib gdbm berkdb slang readline tetex java X tcpd pam libwww ssl perl python gtk opengl gif tiff -arts -kde -gnome mysql ruby tcltk Xaw3d -oggvorbis -alsa" COMPILER="gcc3" CHOST="sparc-unknown-linux-gnu" CFLAGS="-mtune=v8 -O2 -pipe" CXXFLAGS="-mtune=v8 -O2 -pipe -Wno-deprecated -fpermissive" ACCEPT_KEYWORDS="sparc" MAKEOPTS="-j3" AUTOCLEAN="yes" SYNC="rsync://rsync.gentoo.org/gentoo-portage" FEATURES="ccache"
Created attachment 13968 [details] The source of the segfault generator program. This is the source program 'spin.c' referenced in the 'steps to reproduce.' (Unfortunately, I suspect you need sun4m-SMP to get a failure anytime soon. At least, it never fails for me with anything else.)
I notice that in the attached version of the program, at line 108 there are 18 "PTHREAD_MUTEX_INITIALIZER"s missing. This has no effect on the interesting case (2); it really has no effect on anything since such an initializer ends up expanding into six words of 0, but it sure looks bad.
And, at line about 105, the [2] needs to be a [20]. Still no effect for the main+2thread (i.e., failing) case, but it will write in random memory if you try more. (Once I got the failure with 2, which is what I wanted, I should have just left it at that.)
For what it's worth --- not much, I think ---: 1. Fixing the 2-should-have-been-20 problems does not cure the problem; 2. The program runs fine on a Linux (but not GenToo) Ultra-10 UP; 3. The SS20 version (as built on the SS20) runs fine on the U2-SMP. 4. I'm building gdb for the SS20, but I do not expect to learn much. 5. Figuring out how libm gets build appears to be beyond my capabilities, beyond that something magic must happen in glibc-2.3.1/math Stay tuned...
Last comment for the day. If I run the program on a very busy system (like one that's building gdb), the 1-thread case fails immediately if it tries to do any of the math stuff. (It does not matter whether the program is linked -static or not.) So, we don't like to do 'libm' stuff from a thread on a busy system.
Created attachment 14040 [details] Simplest stand-alone failure demo I can create. Compile as shown. This version of the program is simpler that the previous, and the semantics of the mutex use are correct: A mutex will be unlocked only by its owner. For me to get a standalone failure, I cannot remove anything (I need main+2 threads, libm in each thread, and a mutex exercise between main and each thread.) I always see a failure within 45 seconds; 10 is typical.
If you build this program with gcc 2.95, the crash does not occur. Also, gdb shows us these registers after the crash: ... o5 0x40901083 1083183235 sp 0x10658 67160 <<< Reasonable PC, bogus SP o7 0x1065c 67164 <<< Reasonable NPC l0 0xa0843fff -1601945601 ... pc 0xf000 61440 npc 0x31407ad 51644333 So, we are looking for something very similar to the swapon bug - some kind of stack saving/restoring bug. I might as well take this, nobody else will want it.
Sorry, I've let this get stale. It's not unlikely that it's been fixed with the swapon bug. If not, I'll check 2.6 and see. If it's fixed in 2.6 I'm going to close it, which probably isn't fair since 2.6 is ~ and will be for quite some time. Unfortunately we don't have the resources to do much with 2.4 other than keep it compiling. That is, unless you're volunteering. ;-)
Fine with me since the specific application can be restructured. (It's a DaSSF simulation to run across several systems, and threading as opposed to two copies of the same program doesn't buy much.) If I had a clue where to look, I'd play with it, but with very little more priority than you are... :-) Thanks for the update.
A *really* stale bug, should be probably just closed.
Fixed....I guess? wesolows is no longer with us, due in part to his employer, so final status on this will remain unknown I guess. Changeing to NEEDINFO.
As of kernel 2.4.28, failing program (Attachment 14040 [details]) fails consistently within 10 seconds on SS20(2x75). No problem on sparc64 systems.