Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug
Bug#: 181757
Alias:
Product:
Component:
Status: RESOLVED
Resolution: FIXED
Assigned To: Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel@gentoo.org>
Hardware:
OS:
Version:
Priority:
Severity:
Reporter: Abrahm Scully <abrahm.scully@gmail.com>
Add CC:
CC:
Remove selected CCs
URL:
Summary:
Status Whiteboard:
Keywords:

Filename Description Type Creator Created Size Actions
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 181757 depends on: Show dependency tree
Bug 181757 blocks:
Votes: 0    Show votes for this bug    Vote for this bug

Additional Comments: (this is where you put emerge --info)


Not eligible to see or edit group visibility for this bug.






View Bug Activity   |   Format For Printing   |   XML   |   Clone This Bug


Description:   Opened: 2007-06-12 13:46 0000
I decided to do a fresh install of Gentoo last weekend for fun. I got a kernel
BUG about 5 times during an emerge -ev world. I rebooted between each BUG. I
recompiled my kernel between some of them.

from dmesg:
-------------------------------------------------------------------
sh[15252]: segfault at 0000000000000004 rip 000000000041b901 rsp
00007fffc9a9fd08 error 4
Eeek! page_mapcount(page) went negative! (-1)
  page pfn = 3b077
  page->flags = 4000000000010068
  page->count = 1
  page->mapping = ffff81003d73e378
  vma->vm_ops = 0xffffffff805bbbc0
  vma->vm_ops->nopage = filemap_nopage+0x0/0x350
  vma->vm_file->f_op->mmap = xfs_file_mmap+0x0/0x30
------------[ cut here ]------------
kernel BUG at mm/rmap.c:588!
invalid opcode: 0000 [1] SMP 
CPU 0 
Modules linked in: w83627ehf i2c_isa k8temp hwmon i2c_dev i2c_core radeon drm
hci_usb ehci_hcd uhci_hcd usbcore snd_emu10k1 snd_rawmidi snd_ac97_codec
ac97_bus snd_pcm snd_timer snd_page_alloc snd_util_mem snd_hwdep snd
Pid: 15252, comm: sh Not tainted 2.6.20-gentoo-r8 #1
RIP: 0010:[<ffffffff8020acb5>]  [<ffffffff8020acb5>]
page_remove_rmap+0xf5/0x120
RSP: 0000:ffff810022911bd8  EFLAGS: 00010292
RAX: 0000000000000037 RBX: ffff810001ce9a08 RCX: ffffffff803b6150
RDX: 00000000ffffff01 RSI: 0000000000000000 RDI: ffffffff805adb7c
RBP: ffff81003439ff00 R08: 0000000000004e26 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000002 R12: 000000000041b000
R13: 00000000004b0000 R14: 0000000000000020 R15: 00000000003fbfe8
FS:  00002acde16c16d0(0000) GS:ffffffff805e0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000004 CR3: 000000002b11e000 CR4: 00000000000006e0
Process sh (pid: 15252, threadinfo ffff810022910000, task ffff81003dd88920)
Stack:  000000000041b000 ffff81002b8f50d8 ffff810001ce9a08 ffffffff802079fc
 6339613966643034 0000000000000000 ffff810022911ce8 ffffffffffffffff
 0000000000000000 ffff81003439ff00 ffff810022911cf0 0000000000000000
Call Trace:
 [<ffffffff802079fc>] unmap_vmas+0x44c/0x7c0
 [<ffffffff8023a299>] exit_mmap+0x79/0x100
 [<ffffffff8023cacc>] mmput+0x3c/0xd0
 [<ffffffff8021574a>] do_exit+0x20a/0x830
 [<ffffffff8028df08>] __dequeue_signal+0x168/0x1e0
 [<ffffffff8024b352>] do_group_exit+0x82/0x90
 [<ffffffff8022b918>] get_signal_to_deliver+0x418/0x450
 [<ffffffff8025e58e>] do_notify_resume+0xce/0x740
 [<ffffffff8028f1d5>] force_sig_info+0xb5/0xd0
 [<ffffffff8020a96b>] do_page_fault+0x60b/0x860
 [<ffffffff8023c629>] remove_wait_queue+0x19/0x60
 [<ffffffff80228bf5>] do_wait+0xaa5/0xbb0
 [<ffffffff8020cf1f>] dput+0x2f/0x170
 [<ffffffff80261af8>] retint_signal+0x3d/0x85


Code: 0f 0b eb fe 8b 77 18 48 83 c4 08 5b 5d 83 f6 01 83 e6 01 e9 
RIP  [<ffffffff8020acb5>] page_remove_rmap+0xf5/0x120
 RSP <ffff810022911bd8>
 <1>Fixing recursive fault but reboot is needed!
---------------------------------------------------------

Here are more dmesg output:
http://qabe.net/kernel_bug/dmesg1
http://qabe.net/kernel_bug/dmesg2
http://qabe.net/kernel_bug/dmesg3

These all happened while recompiling world. The first hung sh. The latter 2
hung gcc. After the third, the kernel froze (got blinking keyboard lights) in
the middle of a reboot and X still had my monitor, so I don't know what
happened.


Reproducible: Couldn't Reproduce




After the first time, I decided to recompile my kernel, removing a few unneeded
drivers and features. The second time, I decided to enable the "optimize for
size" option to see if i could obscure the bug a little. The third time, I went
back and compiled a bunch of drivers as modules.

Every reference to this bug elsewhere (earliest reference on LKML is 2.6.16) is
not on 64-bit, but on 32-bit. I couldn't find a fix posted elsewhere.

The first time it happened, voluntary preemption was selected in the kernel. I
selected preemption instead. I only have my most recent kernel config.

http://qabe.net/kernel_bug/config

This config is different from the first two, but not by that much. My emerge
was actually interrupted more that 3 times, but got lazy. I'll try to be more
methodical.

------- Comment #1 From Jakub Moc (RETIRED) 2007-06-12 18:05:25 0000 -------
Can you reproduce this w/ 2.6.21-r3?

------- Comment #2 From Abrahm Scully 2007-06-12 23:10:41 0000 -------
Wrote a shell script called loop.sh:
#!/bin/sh

if [ -n "$1" ] ; then
        while /bin/true
        do
                $1
                echo Press CTRL-C now.
                sleep 1
        done
fi

and ran ./loop.sh emerge\ -v\ mplayer

Sure enough, another BUG! dmesg output at http://qabe.net/kernel_bug/dmesg4. I
don't know how long it took.

I'm installing a newer kernel, but I found a message on lkml that looks like
this bug is happening in 2.6.21, too. http://lkml.org/lkml/2007/5/2/277

------- Comment #3 From Abrahm Scully 2007-06-13 02:58:22 0000 -------
Updated to gentoo-sources-2.6.21-r3.

Looped emerging mplayer again... and the BUG is still there!
Although the line of code in rmap.c moved from 588 to 596.

dmesg output at http://qabe.net/kernel_bug/dmesg5

------- Comment #4 From Abrahm Scully 2007-06-13 14:01:53 0000 -------
This bug looks like a duplicate of 138366 and 138863, only with a newer kernel,
64-bit architecture, newer gcc, newer glibc, and my kernel isn't tainted. Also,
I have ECC memory and error correction/detection is enabled in my bios.
(Although, I don't have the k8 EDAC patches in my kernel, so I don't know
what's going on.)

I'll be disabling all unnecessary drivers, one at a time, to see if I can get a
change in behavior. My girlfriend is leaving town for a month, so I should have
some free time.

I have alot of experience debugging C from my last job. Is there an equivelent
to breakpoints/gdb for the kernel?

I'd like to point out, that this problem never occured with my last install.
The differences between this install and the last were: was 32-bit, now 64-bit;
primary drive PATA (non-libata via IDE drivers), primary drive SATA (libata
sata_via drivers). When I'm compiling all day, the disk drivers are used the
most (I'm guessing). Maybe I'll start there.

------- Comment #5 From Abrahm Scully 2007-06-13 14:04:08 0000 -------
http://qabe.net/kernel_bug/lspci

------- Comment #6 From Abrahm Scully 2007-06-14 16:16:21 0000 -------
I had libata VIA PATA support and libata VIA SATA support both enabled in my
kernel. On a hunch, I disabled the libata via PATA support and rebooted. I have
not had a single BUG in 24 hours of compiling. Note that the only thing
actually plugged into the PATA ports are my CD-ROM drives and they are never
used. I'll report again soon.

I'm still using gentoo-sources-2.6.21-r3.

------- Comment #7 From Abrahm Scully 2007-06-14 16:26:40 0000 -------
I'll try the current kernel with old VIA IDE drivers. I'll also try 2.6.22-rc4
when I get a chance. It looks like sata_via.c had a bunch of work done to it.

------- Comment #8 From Abrahm Scully 2007-06-14 20:28:29 0000 -------
update: compiling for 36 hours without a BUG.

I'm about to reboot and start testing with libata VIA SATA and libata VIA PATA
support enabled in 2.6.22-rc4.

------- Comment #9 From Abrahm Scully 2007-06-14 22:54:36 0000 -------
This bug has not appeared yet in 2.6.22-rc4. For the sake of this bug, I will
continue testing for a few more hours, but I have my new motherboard in the
other room and I'm growing impatient.

gentoo-sources-2.6.20-r8 <- BUG with libata VIA SATA and VIA PATA enabled.
gentoo-sources-2.6.21-r3 <- BUG with libata VIA SATA and VIA PATA enabled
vanilla-sources-2.6.22_rc4 <- no BUG (yet) with libata VIA SATA and VIA PATA
enabled.

------- Comment #10 From Abrahm Scully 2007-06-14 22:58:36 0000 -------
The picture isn't complete without these cases as well.

gentoo-sources-2.6.20-r8 <- no BUG with libata VIA SATA enabled and VIA PATA
disabled.
gentoo-sources-2.6.21-r3 <- no BUG with libata VIA SATA enabled and VIA PATA
disabled.

------- Comment #11 From Abrahm Scully 2007-06-14 23:12:09 0000 -------
Spoke too soon.

vanilla-sources-2.6.22-rc4 <- BUG with libata VIA SATA and PATA drivers enabled

http://qabe.net/kernel_bug/dmesg6

------- Comment #12 From Abrahm Scully 2007-06-14 23:15:58 0000 -------
Crap. I can't keep my kernels straight. Scratch that last comment. It's clear
from my dmesg that I am running gentoo-sources-2.6.21-r3 with libata VIA SATA
and PATA enabled.

So, 2.6.22-rc4 is untested.

------- Comment #13 From Duane Griffin 2007-06-18 00:13:26 0000 -------
Could you try turning on the "Kernel hacking"->"Kernel debugging"->"Debug VM"
option? And just to confirm the current state of play, the bug is reproducible
with the SATA and PATA VIA drivers, but not with only the SATA driver, under
all kernels tested so far, correct?

BTW, nothing to do with the issue at hand I'm sure, but your dmesg3 shows a
slightly different "BIOS-provided physical RAM map" than the others. Rather
odd.

------- Comment #14 From Abrahm Scully 2007-06-18 02:23:04 0000 -------
I'll try the "debug vm" option.

To confirm the current state of play, the bug is reproducible when both the
libata SATA and PATA VIA drivers are enabled, but not when only the libata SATA
via driver is enabled. This under all kernels tested so far
(gentoo-sources-2.6.20-r8 and gentoo-sources-2.6.21-r3).

------- Comment #15 From Daniel Drake 2007-07-10 13:23:43 0000 -------
Please test 2.6.22.

------- Comment #16 From Duane Griffin 2007-07-10 18:43:13 0000 -------
There was a recent post from Alan Cox on LKML which may be of relevance here.
He mentioned that "there are some cases where trying to load both old and new
IDE support for the same chip will do strange things."

So it might be that this is a known limitation, at least known by the high
priests of IDE/libata. Maybe we should follow up whether it should be
investigated or whether the solution is just "don't do that!"

See: http://marc.info/?l=linux-kernel&m=118401976128199&w=2

------- Comment #17 From Abrahm Scully 2007-07-10 21:19:08 0000 -------
(In reply to comment #16)
> There was a recent post from Alan Cox on LKML which may be of relevance here.
> He mentioned that "there are some cases where trying to load both old and new
> IDE support for the same chip will do strange things."
> 

I don't have old IDE support enabled.

------- Comment #18 From Abrahm Scully 2007-07-10 21:20:36 0000 -------
(In reply to comment #15)
> Please test 2.6.22.
> 

I've been up for 7 days with 2.6.22-rc7 with no sign of this bug. I'm going to
update to the official 2.6.22 in the next few minutes.

------- Comment #19 From Duane Griffin 2007-07-11 08:48:16 0000 -------
Ah, d'oh. Of course you don't, sorry. Thinko.

------- Comment #20 From Abrahm Scully 2007-07-12 21:35:32 0000 -------
Well, gentoo-sources-2.6.22 is much better behaved on my computer than earlier
kernels. I have seen no issues yet.

------- Comment #21 From Abrahm Scully 2007-07-16 04:45:24 0000 -------
I didn't see this bug at all with generic 2.6.22-rc7 kernel downloaded from
kernel.org.

I haven't seen this bug with gentoo-sources 2.6.22.

------- Comment #22 From Daniel Drake 2007-07-22 04:21:25 0000 -------
I can't see what would have caused or fixed this, and tracking down the actual
fix would be a very lengthy process. I'm going to close this as an artifact
fixed in 2.6.22. Thanks for reporting and keeping us up to date.

Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug