Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 273936 - =sys-kernel/gentoo-sources-2.6.30* randomly freezes (upstream SMP bug FIXED in GIT 2.6.31 kernels)
Summary: =sys-kernel/gentoo-sources-2.6.30* randomly freezes (upstream SMP bug FIXED i...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: High major (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard: linux-2.6.31
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-13 02:46 UTC by Roger
Modified: 2009-09-19 01:45 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
My Kernel Config (.config,72.46 KB, text/plain)
2009-06-13 02:48 UTC, Roger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Roger 2009-06-13 02:46:38 UTC
System totally locks up at random.


History:
2.6.28* The last stable for me on x86.
2.6.29* Encountered suspend/hibernate issues with net.eth0 started.
2.6.30* System randomly freezes.

Reproducible: Always

Steps to Reproduce:
Comment 1 Roger 2009-06-13 02:48:14 UTC
Created attachment 194467 [details]
My Kernel Config

For kicks, here's my kernel config.


Since system log isn't showing much, all (I) can do is wait. :-/
Comment 2 Roger 2009-06-13 06:46:39 UTC
Seems this is more of a bug using the =nvidia-driver-180.60 driver and =sys-kernel/gentoo-sources-2.6.30-r1 sources.


Using kernel 2.6.29, booting with framebuffer enabled did not hinder Xorg at all with the binary nvidia driver (although it was been known to do on previous versions).

However, 2.6.30* booting with framebuffer enabled conflicts with X/Xorg usage causing a hard system freeze!


(To the best of my knowledge, this is the source of my bug, as such, am renaming the title more appropriately!)
Comment 3 Roger 2009-06-13 06:47:39 UTC
If somebody is on the NVidia mailing list, please feel free to report upstream! 
Comment 4 Roger 2009-06-13 08:17:54 UTC
Well, I was chugging along for several hours on 2.6.30, and it froze without framebuffer enabled.

The only thing I had done within the past minutes was change the sym link /usr/src/linux to point to /usr/src/linux-2.6.30-gentoo-r1.


(Changing the title back to a general hard system freeze on this kernel.  This can probably be further troubleshot by running without any tainted modules.  Past experience dictates it usually is the nvidia binary driver just acting up at very peculiar times!)
Comment 5 Jeroen Roovers (RETIRED) gentoo-dev 2009-06-13 14:58:45 UTC
Please reopen this bug report once you have figured out what causes your system to hang. Right now it looks like it could be anything, including hardware failures induced by newer kernels' features.

Extra hints:

1) What really /is/ a "hard freeze"?
2) Severity should certainly not be "Blocker".
3) You omitted posting `emerge --info'.
Comment 6 Roger 2009-06-13 22:43:43 UTC
> Please reopen this bug report once you have figured out what causes your system
> to hang. Right now it looks like it could be anything, including hardware
> failures induced by newer kernels' features.

Right now, ruled-out with the use of kernel-2.6.29*.  I'm doing just fine with 2.6.29.  2.6.30* *is* crashing.  I'll gladly follow-up stating I'm wrong, but 2.6.29* was up all last night compiling.

New kernel features is why I filed this bug.  If it is, it can be a very very long time until I track it down!  (So, file it and make some noise so others are aware something is going on.)


> Extra hints:

> 1) What really /is/ a "hard freeze"?
A "hard freeze" is a brief simple analogy of a computer freezing without being able to be avert or stop.  Basically, sys req keys stop functioning as well as tty (& usually including external serial terminals).  However, I think it's a lot easier for experienced Linux users to just state something like "Hard Freeze".


Due to past numerous experiences with the NVidia binary driver, these hard system freezes with sparse debug info are usually caused by the proprietary binary drivers and is a very good starting point for debugging.

Leaving as "NEEDINFO" as I have no additional relevant info besides being able to theorize it's likely the NVidia driver -- especially due to the numerous kernel changes, as well as past history of the NVidia binary driver!

*NOTE* this bug occurs without framebuffer enabled.  However, it makes itself more readily apparent at the start of X with it enabled.

Comment 7 wylda 2009-08-19 17:47:40 UTC
Hi,

you can check following bugreport if you see some similarity.


Regards,
Pavel Vilim

[Bug 13933] System lockup on dual Pentium-3 with kernel

http://bugzilla.kernel.org/show_bug.cgi?id=13933
Comment 8 Roger 2009-08-20 07:16:30 UTC
I've run this down to two bugs with kernel-2.6.30

1) http://bugzilla.kernel.org/show_bug.cgi?id=13992
If the user has data=ordered within /etc/fstab for ext3 mount options, booting will halt because the kernel cannot mount the filesystem as the kernel feature is now optional and considered not desirable.  <shrugs> I prefer stability rather then corrupt photos.

2) http://bugzilla.kernel.org/show_bug.cgi?id=13991
Intel e100 module is locking the kernel randomly.  Tough to debug because even serial console freezes.  But this can be related to the other numerous PCI bus issues, such as the bug you suggested.

Ditto, I do have dual P3's, but I got 16+ hours of uptime before rebooting without loading e100 (and black listing it from loading).  So I'm pointing my finger at the e100 module, or maybe a PCI bus related issue.  But with the numerous changes to e100.c within the past kernel version, bets are on e100.c as the cause.
Comment 9 Roger 2009-08-20 16:21:42 UTC
I've examined http://bugzilla.kernel.org/show_bug.cgi?id=13933
and it appears to be a bug being spawned on SMP only systems.

A work around (hack) is to boot with "nosmp" kernel boot parameter.  If it resolves the freezes, it's then related to Linux Kernel Bug 13933.  As of now, the only detection of this bug is with the "nosmp" parameter as they haven't been able to get standard debugging working with the kernel yet.  It looks like they're using GIT to hack in patching or something from what I've scanned over.

For the next few nights, I'll be simply testing the "nosmp" flag, along with determining if e100.c code is anyway involved.

<shrugs> Would rather get decent GDB/KGDB output rather then playing games with the kernel.  ;-)
Comment 10 David W Noon 2009-08-30 17:21:45 UTC
I can add some further confirmation of this problem.

I have been experiencing random freezes on a machine with two AMD Athlon MP processors (i.e. an SMP system) and with an Intel 100-megabit Ether Express NIC chipset. These problems only began when I upgraded to a 2.6.30 kernel, specifically gentoo-sources-2.6.30-r4.

As of now, I have reverted to a 2.6.29 kernel.
Comment 11 Roger 2009-08-30 19:25:03 UTC
David, you need to see the Linux Kernel Bug I stated in Comment  #9.  Please go to the actual kernel.org url.  The wiki incorrectly assumes the shortened Bug # as a gentoo.org bug.  This should be a kernel.org bug.

And, this is fixed upstream in kernel.org.

Gentoo should probably backport the 2.6.31 fix to 2.6.30, as they have stated 2.6.30 as being rock stable (even after all of us SMP'ers complaining -- but will probably wait like me until 2.6.31 is out).
Comment 12 David W Noon 2009-08-30 21:39:57 UTC
(In reply to comment #11)
> David, you need to see the Linux Kernel Bug I stated in Comment  #9.  Please go
> to the actual kernel.org url.  The wiki incorrectly assumes the shortened Bug #
> as a gentoo.org bug.  This should be a kernel.org bug.

Thanks, Roger.  That bug reads like a whodunnit!

I'll wait and see what happens with the Gentoo kernels.
Comment 13 Roger 2009-08-30 21:54:45 UTC
In the meantime, I have reopened this bug as it hasn't yet been fixed on Gentoo (downstream for 2.6.30) as it is a severe/critical issue on all SMP boxes.

I'll wait to mark it fixed/resolved as gentoo-sources are still using 2.6.30 as stable and wait to mark it as such, until 1) either the patch is backported or 2) -- more likely -- gentoo-sources releases >2.6.30 w/ the incorporated patch.

I too am now sticking with either 2.6.29 (or git-sources-2.6.31*).

Thanks for tracking down this bug and verifying.
Comment 14 Wormo (RETIRED) gentoo-dev 2009-09-01 06:38:55 UTC
That upstream SMP bug does look very relevant. Assigning report to kernel team, I think they will want to see this...
Comment 15 Richard Gray 2009-09-06 02:21:38 UTC
(In reply to comment #14)
> That upstream SMP bug does look very relevant. Assigning report to kernel team,
> I think they will want to see this...

I've been experiencing this problem ever since I tried to update from 2.6.28-r5 straight to the 2.6.30-rx series. All the 30's seem to have this instability and I haven't been able keep a machine running more than a day at best. I'm using old Dell server boxes (PIII's) with a frame-buffer, but I think it's an ATI Rage 128 video adapter, so I think the problem probably is not an nVidia issue. I do wonder about the e100 net card driver though - I am using that. On the 2.6.28 build I used the eepro100 driver and I did have to delve into the kernel config to bring e100 into play after it appears eepro100 was deleted (humph).

I wish I could shed more light on this, but there sod-all in the logs so I'm as much in the dark as everyone else really. I only post this comment in the hope that some thread emerges from our respective experiences.

Comment 16 Roger 2009-09-06 05:44:58 UTC
<sigh!>

This is a KERNEL BUG!  It has been resolved upstream!

See for more info:
http://bugzilla.kernel.org/show_bug.cgi?id=13933

(FYI: This bug is still not fixed in 2.6.30* series.  It is fixed in GIT versions.)

I'm going to change the title of this bug to show it's resolved upstream.
Comment 17 Roger 2009-09-09 06:37:46 UTC
The fix for this bug:
x86: don't call '->send_IPI_mask()' with an empty mask

... has been now officially incorporated in 2.6.30.6 released today Sep 09,2009.

http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.30.6

... so we just need to wait for this release to be merged into Portage now. ;-)
Comment 18 George Kadianakis (RETIRED) gentoo-dev 2009-09-16 14:51:54 UTC
The patch is now released with the new genpatches-2.6.30-8 release.
Comment 19 Roger 2009-09-18 22:30:35 UTC
I'm pressuming, a new gentoo-sources with this patch included will not be released for sometime???

From what I'm seeing gentoo-sources-2.6.30-r6 still is *only* at patch "genpatches-2.6.30-7 release"!

As such, I think it's a bit premature to close the bug since the release hasn't been published yet.
Comment 20 Mike Pagano gentoo-dev 2009-09-18 23:40:08 UTC
(In reply to comment #19)
> I'm pressuming, a new gentoo-sources with this patch included will not be
> released for sometime???
> 
> From what I'm seeing gentoo-sources-2.6.30-r6 still is *only* at patch
> "genpatches-2.6.30-7 release"!
> 

gentoo-souces-2.6.30-r7[1] contains genpatches-2.6.30-8 [2]. Not sure what you're looking at.

[1] http://sources.gentoo.org/viewcvs.py/gentoo-x86/sys-kernel/gentoo-sources/gentoo-sources-2.6.30-r7.ebuild?view=markup
[2] http://sources.gentoo.org/viewcvs.py/linux-patches/genpatches-2.6/tags/2.6.30-8/
Comment 21 Roger 2009-09-19 00:08:51 UTC
Latest gentoo-source ChangeLog entry:

*gentoo-sources-2.6.30-r7 (16 Sep 2009)

  16 Sep 2009; Mike Pagano <mpagano@gentoo.org>
  +gentoo-sources-2.6.30-r7.ebuild:
  Linux patch versions 2.6.30.6 and 2.6.30.7. Header fix for sysrq.h to
  include errno.h.
Comment 22 Mike Pagano gentoo-dev 2009-09-19 01:45:55 UTC
(In reply to comment #21)
> Latest gentoo-source ChangeLog entry:


Your point?