23649 – sparc32(sun4m)-SMP, two threads simultaneously in '-lm' heavily => SegFault

Bug 23649 - sparc32(sun4m)-SMP, two threads simultaneously in '-lm' heavily => SegFault

Summary: sparc32(sun4m)-SMP, two threads simultaneously in '-lm' heavily => SegFault

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	Sparc Linux

Importance:	High normal (vote)
Assignee:	Sparc Porters

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2003-06-28 10:56 UTC by Ferris McCormick (RETIRED)
Modified:	2006-02-04 06:05 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
The source of the segfault generator program. (spin.c,4.52 KB, text/plain) 2003-06-28 11:11 UTC, Ferris McCormick (RETIRED)	Details
Simplest stand-alone failure demo I can create. Compile as shown. (skel.c,3.50 KB, text/plain) 2003-06-30 09:24 UTC, Ferris McCormick (RETIRED)	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ferris McCormick (RETIRED) gentoo-dev

2003-06-28 10:56:43 UTC

On sparc32-SMP (but apparently not on sparc64-SMP), if I have a multithreaded
program containing two threads simultaneously computing 'd=cos(d),' say, I pretty
quickly get a Segment Fault (sorry, I haven't figured out where).

1.  It seems that if I have one, three, four, ... threads, the failure never (or
    very seldom) occurs.
2.  If instead of 'd=cos(d)' I use 'd=random()' the problem never (or seldom)
    occurs
3.  I'll attach the actual program to a followup; I don't see a way to attach it
    to an original.

Reproducible: Always
Steps to Reproduce:
1.gcc -O2 -o spin spin.c -lm -lpthread
2.spin 2 M
3.Wait a few seconds

Actual Results:  
Seg Fault

Expected Results:  
Redline both CPUs forever.

fmccor@dragonfly:SIMULATION/DASSF/APP [377]% emerge info
Portage 2.0.48-r1 (default-sparc-1.4, gcc-3.2.2, glibc-2.3.1-r4)
=================================================================
System uname: 2.4.21-sparc-r1 sparc sun4m
GENTOO_MIRRORS="http://gentoo.oregonstate.edu
http://www.ibiblio.org/pub/Linux/distributions/gentoo"
CONFIG_PROTECT="/etc /var/qmail/control /usr/share/config
/usr/kde/2/share/config /usr/kde/3/share/config
/usr/X11R6/lib/X11/xkb:/usr/share/texmf/tex/generic/config/
/usr/share/texmf/tex/platex/config/"
CONFIG_PROTECT_MASK="/etc/gconf /etc/env.d"
PORTDIR="/usr/portage"
DISTDIR="/usr/portage/distfiles"
PKGDIR="/usr/portage/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR_OVERLAY=""
USE="sparc crypt fbcon imlib jpeg mikmod motif mpeg ncurses nls png qt spell
truetype xv zlib gdbm berkdb slang readline tetex java X tcpd pam libwww ssl
perl python gtk opengl gif tiff -arts -kde -gnome mysql ruby tcltk Xaw3d
-oggvorbis -alsa"
COMPILER="gcc3"
CHOST="sparc-unknown-linux-gnu"
CFLAGS="-mtune=v8 -O2 -pipe"
CXXFLAGS="-mtune=v8 -O2 -pipe -Wno-deprecated -fpermissive"
ACCEPT_KEYWORDS="sparc"
MAKEOPTS="-j3"
AUTOCLEAN="yes"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
FEATURES="ccache"

Comment 1 Ferris McCormick (RETIRED) gentoo-dev

2003-06-28 11:11:13 UTC

Created attachment 13968 [details]
The source of the segfault generator program.

This is the source program 'spin.c' referenced in the 'steps to reproduce.'
(Unfortunately, I suspect you need sun4m-SMP to get a failure anytime soon.  At

least, it never fails for me with anything else.)

Comment 2 Ferris McCormick (RETIRED) gentoo-dev

2003-06-28 17:39:03 UTC

I notice that in the attached version of the program, at line 108 there
are 18 "PTHREAD_MUTEX_INITIALIZER"s missing.  This has no effect on the
interesting case (2); it really has no effect on anything since such
an initializer ends up expanding into six words of 0, but it sure looks bad.

Comment 3 Ferris McCormick (RETIRED) gentoo-dev

2003-06-28 17:56:15 UTC

And, at line about 105, the [2] needs to be a [20].  Still no effect for
the main+2thread (i.e., failing) case, but it will write in random memory
if you try more.  (Once I got the failure with 2, which is what I wanted,
I should have just left it at that.)

Comment 4 Ferris McCormick (RETIRED) gentoo-dev

2003-06-29 08:24:40 UTC

For what it's worth --- not much, I think ---:

1.  Fixing the 2-should-have-been-20 problems does not cure the problem;
2.  The program runs fine on a Linux (but not GenToo) Ultra-10 UP;
3.  The SS20 version (as built on the SS20) runs fine on the U2-SMP.

4.  I'm building gdb for the SS20, but I do not expect to learn much.

5.  Figuring out how libm gets build appears to be beyond my capabilities,
    beyond that something magic must happen in glibc-2.3.1/math

Stay tuned...

Comment 5 Ferris McCormick (RETIRED) gentoo-dev

2003-06-29 09:11:40 UTC

Last comment for the day.  If I run the program on a very busy system (like one
that's building gdb), the 1-thread case fails immediately if it tries to do any
of the math stuff.  (It does not matter whether the program is linked -static or
not.)  So, we don't like to do 'libm' stuff from a thread on a busy system.

Comment 6 Ferris McCormick (RETIRED) gentoo-dev

2003-06-30 09:24:41 UTC

Created attachment 14040 [details]
Simplest stand-alone failure demo I can create.  Compile as shown. 

This version of the program is simpler that the previous, and the semantics of
the mutex use are correct:  A mutex will be unlocked only by its owner.

For me to get a standalone failure, I cannot remove anything (I need main+2
threads, libm in each thread, and a mutex exercise between main and each
thread.)

I always see a failure within 45 seconds; 10 is typical.

Comment 7 Keith M Wesolowski (RETIRED) gentoo-dev

2003-08-02 16:15:18 UTC

If you build this program with gcc 2.95, the crash does not occur.  Also, gdb shows us these registers after the crash:
...
o5             0x40901083       1083183235
sp             0x10658  67160                 <<< Reasonable PC, bogus SP
o7             0x1065c  67164                 <<< Reasonable NPC
l0             0xa0843fff       -1601945601
...
pc             0xf000   61440
npc            0x31407ad        51644333

So, we are looking for something very similar to the swapon bug - some kind of stack saving/restoring bug.

I might as well take this, nobody else will want it.

Comment 8 Keith M Wesolowski (RETIRED) gentoo-dev

2004-01-02 22:50:55 UTC

Sorry, I've let this get stale.  It's not unlikely that it's been fixed with the swapon bug.  If not, I'll check 2.6 and see.  If it's fixed in 2.6 I'm going to close it, which probably isn't fair since 2.6 is ~ and will be for quite some time.  Unfortunately we don't have the resources to do much with 2.4 other than keep it compiling.

That is, unless you're volunteering. ;-)

Comment 9 Ferris McCormick (RETIRED) gentoo-dev

2004-01-03 07:57:32 UTC

Fine with me since the specific application can be restructured. (It's a DaSSF
simulation to run across several systems, and threading as opposed to two
copies of the same program doesn't buy much.)

If I had a clue where to look, I'd play with it, but with very little more
priority than you are...  :-)

Thanks for the update.

Comment 10 Jakub Moc (RETIRED) gentoo-dev

2005-07-27 08:53:02 UTC

A *really* stale bug, should be probably just closed.

Comment 11 Joshua Kinard gentoo-dev

2005-07-28 22:16:26 UTC

Fixed....I guess?  wesolows is no longer with us, due in part to his employer,
so final status on this will remain unknown I guess.  Changeing to NEEDINFO.

Comment 12 Ferris McCormick (RETIRED) gentoo-dev

2005-07-29 05:13:10 UTC

As of kernel 2.4.28, failing program (Attachment 14040 [details]) fails consistently
within 10 seconds on SS20(2x75).  No problem on sparc64 systems.