Home | Docs | Forums | Lists | Bugs | Planet | Store | GMN | Get Gentoo!
Not eligible to see or edit group visibility for this bug.
View Bug Activity | Format For Printing | XML | Clone This Bug
I decided to do a fresh install of Gentoo last weekend for fun. I got a kernel BUG about 5 times during an emerge -ev world. I rebooted between each BUG. I recompiled my kernel between some of them. from dmesg: ------------------------------------------------------------------- sh[15252]: segfault at 0000000000000004 rip 000000000041b901 rsp 00007fffc9a9fd08 error 4 Eeek! page_mapcount(page) went negative! (-1) page pfn = 3b077 page->flags = 4000000000010068 page->count = 1 page->mapping = ffff81003d73e378 vma->vm_ops = 0xffffffff805bbbc0 vma->vm_ops->nopage = filemap_nopage+0x0/0x350 vma->vm_file->f_op->mmap = xfs_file_mmap+0x0/0x30 ------------[ cut here ]------------ kernel BUG at mm/rmap.c:588! invalid opcode: 0000 [1] SMP CPU 0 Modules linked in: w83627ehf i2c_isa k8temp hwmon i2c_dev i2c_core radeon drm hci_usb ehci_hcd uhci_hcd usbcore snd_emu10k1 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm snd_timer snd_page_alloc snd_util_mem snd_hwdep snd Pid: 15252, comm: sh Not tainted 2.6.20-gentoo-r8 #1 RIP: 0010:[<ffffffff8020acb5>] [<ffffffff8020acb5>] page_remove_rmap+0xf5/0x120 RSP: 0000:ffff810022911bd8 EFLAGS: 00010292 RAX: 0000000000000037 RBX: ffff810001ce9a08 RCX: ffffffff803b6150 RDX: 00000000ffffff01 RSI: 0000000000000000 RDI: ffffffff805adb7c RBP: ffff81003439ff00 R08: 0000000000004e26 R09: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000002 R12: 000000000041b000 R13: 00000000004b0000 R14: 0000000000000020 R15: 00000000003fbfe8 FS: 00002acde16c16d0(0000) GS:ffffffff805e0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000004 CR3: 000000002b11e000 CR4: 00000000000006e0 Process sh (pid: 15252, threadinfo ffff810022910000, task ffff81003dd88920) Stack: 000000000041b000 ffff81002b8f50d8 ffff810001ce9a08 ffffffff802079fc 6339613966643034 0000000000000000 ffff810022911ce8 ffffffffffffffff 0000000000000000 ffff81003439ff00 ffff810022911cf0 0000000000000000 Call Trace: [<ffffffff802079fc>] unmap_vmas+0x44c/0x7c0 [<ffffffff8023a299>] exit_mmap+0x79/0x100 [<ffffffff8023cacc>] mmput+0x3c/0xd0 [<ffffffff8021574a>] do_exit+0x20a/0x830 [<ffffffff8028df08>] __dequeue_signal+0x168/0x1e0 [<ffffffff8024b352>] do_group_exit+0x82/0x90 [<ffffffff8022b918>] get_signal_to_deliver+0x418/0x450 [<ffffffff8025e58e>] do_notify_resume+0xce/0x740 [<ffffffff8028f1d5>] force_sig_info+0xb5/0xd0 [<ffffffff8020a96b>] do_page_fault+0x60b/0x860 [<ffffffff8023c629>] remove_wait_queue+0x19/0x60 [<ffffffff80228bf5>] do_wait+0xaa5/0xbb0 [<ffffffff8020cf1f>] dput+0x2f/0x170 [<ffffffff80261af8>] retint_signal+0x3d/0x85 Code: 0f 0b eb fe 8b 77 18 48 83 c4 08 5b 5d 83 f6 01 83 e6 01 e9 RIP [<ffffffff8020acb5>] page_remove_rmap+0xf5/0x120 RSP <ffff810022911bd8> <1>Fixing recursive fault but reboot is needed! --------------------------------------------------------- Here are more dmesg output: http://qabe.net/kernel_bug/dmesg1 http://qabe.net/kernel_bug/dmesg2 http://qabe.net/kernel_bug/dmesg3 These all happened while recompiling world. The first hung sh. The latter 2 hung gcc. After the third, the kernel froze (got blinking keyboard lights) in the middle of a reboot and X still had my monitor, so I don't know what happened. Reproducible: Couldn't Reproduce After the first time, I decided to recompile my kernel, removing a few unneeded drivers and features. The second time, I decided to enable the "optimize for size" option to see if i could obscure the bug a little. The third time, I went back and compiled a bunch of drivers as modules. Every reference to this bug elsewhere (earliest reference on LKML is 2.6.16) is not on 64-bit, but on 32-bit. I couldn't find a fix posted elsewhere. The first time it happened, voluntary preemption was selected in the kernel. I selected preemption instead. I only have my most recent kernel config. http://qabe.net/kernel_bug/config This config is different from the first two, but not by that much. My emerge was actually interrupted more that 3 times, but got lazy. I'll try to be more methodical.
Can you reproduce this w/ 2.6.21-r3?
Wrote a shell script called loop.sh: #!/bin/sh if [ -n "$1" ] ; then while /bin/true do $1 echo Press CTRL-C now. sleep 1 done fi and ran ./loop.sh emerge\ -v\ mplayer Sure enough, another BUG! dmesg output at http://qabe.net/kernel_bug/dmesg4. I don't know how long it took. I'm installing a newer kernel, but I found a message on lkml that looks like this bug is happening in 2.6.21, too. http://lkml.org/lkml/2007/5/2/277
Updated to gentoo-sources-2.6.21-r3. Looped emerging mplayer again... and the BUG is still there! Although the line of code in rmap.c moved from 588 to 596. dmesg output at http://qabe.net/kernel_bug/dmesg5
This bug looks like a duplicate of 138366 and 138863, only with a newer kernel, 64-bit architecture, newer gcc, newer glibc, and my kernel isn't tainted. Also, I have ECC memory and error correction/detection is enabled in my bios. (Although, I don't have the k8 EDAC patches in my kernel, so I don't know what's going on.) I'll be disabling all unnecessary drivers, one at a time, to see if I can get a change in behavior. My girlfriend is leaving town for a month, so I should have some free time. I have alot of experience debugging C from my last job. Is there an equivelent to breakpoints/gdb for the kernel? I'd like to point out, that this problem never occured with my last install. The differences between this install and the last were: was 32-bit, now 64-bit; primary drive PATA (non-libata via IDE drivers), primary drive SATA (libata sata_via drivers). When I'm compiling all day, the disk drivers are used the most (I'm guessing). Maybe I'll start there.
http://qabe.net/kernel_bug/lspci
I had libata VIA PATA support and libata VIA SATA support both enabled in my kernel. On a hunch, I disabled the libata via PATA support and rebooted. I have not had a single BUG in 24 hours of compiling. Note that the only thing actually plugged into the PATA ports are my CD-ROM drives and they are never used. I'll report again soon. I'm still using gentoo-sources-2.6.21-r3.
I'll try the current kernel with old VIA IDE drivers. I'll also try 2.6.22-rc4 when I get a chance. It looks like sata_via.c had a bunch of work done to it.
update: compiling for 36 hours without a BUG. I'm about to reboot and start testing with libata VIA SATA and libata VIA PATA support enabled in 2.6.22-rc4.
This bug has not appeared yet in 2.6.22-rc4. For the sake of this bug, I will continue testing for a few more hours, but I have my new motherboard in the other room and I'm growing impatient. gentoo-sources-2.6.20-r8 <- BUG with libata VIA SATA and VIA PATA enabled. gentoo-sources-2.6.21-r3 <- BUG with libata VIA SATA and VIA PATA enabled vanilla-sources-2.6.22_rc4 <- no BUG (yet) with libata VIA SATA and VIA PATA enabled.
The picture isn't complete without these cases as well. gentoo-sources-2.6.20-r8 <- no BUG with libata VIA SATA enabled and VIA PATA disabled. gentoo-sources-2.6.21-r3 <- no BUG with libata VIA SATA enabled and VIA PATA disabled.
Spoke too soon. vanilla-sources-2.6.22-rc4 <- BUG with libata VIA SATA and PATA drivers enabled http://qabe.net/kernel_bug/dmesg6
Crap. I can't keep my kernels straight. Scratch that last comment. It's clear from my dmesg that I am running gentoo-sources-2.6.21-r3 with libata VIA SATA and PATA enabled. So, 2.6.22-rc4 is untested.
Could you try turning on the "Kernel hacking"->"Kernel debugging"->"Debug VM" option? And just to confirm the current state of play, the bug is reproducible with the SATA and PATA VIA drivers, but not with only the SATA driver, under all kernels tested so far, correct? BTW, nothing to do with the issue at hand I'm sure, but your dmesg3 shows a slightly different "BIOS-provided physical RAM map" than the others. Rather odd.
I'll try the "debug vm" option. To confirm the current state of play, the bug is reproducible when both the libata SATA and PATA VIA drivers are enabled, but not when only the libata SATA via driver is enabled. This under all kernels tested so far (gentoo-sources-2.6.20-r8 and gentoo-sources-2.6.21-r3).
Please test 2.6.22.
There was a recent post from Alan Cox on LKML which may be of relevance here. He mentioned that "there are some cases where trying to load both old and new IDE support for the same chip will do strange things." So it might be that this is a known limitation, at least known by the high priests of IDE/libata. Maybe we should follow up whether it should be investigated or whether the solution is just "don't do that!" See: http://marc.info/?l=linux-kernel&m=118401976128199&w=2
(In reply to comment #16) > There was a recent post from Alan Cox on LKML which may be of relevance here. > He mentioned that "there are some cases where trying to load both old and new > IDE support for the same chip will do strange things." > I don't have old IDE support enabled.
(In reply to comment #15) > Please test 2.6.22. > I've been up for 7 days with 2.6.22-rc7 with no sign of this bug. I'm going to update to the official 2.6.22 in the next few minutes.
Ah, d'oh. Of course you don't, sorry. Thinko.
Well, gentoo-sources-2.6.22 is much better behaved on my computer than earlier kernels. I have seen no issues yet.
I didn't see this bug at all with generic 2.6.22-rc7 kernel downloaded from kernel.org. I haven't seen this bug with gentoo-sources 2.6.22.
I can't see what would have caused or fixed this, and tracking down the actual fix would be a very lengthy process. I'm going to close this as an artifact fixed in 2.6.22. Thanks for reporting and keeping us up to date.