Summary: | sys-kernel/gentoo-sources-3.1.3: kernel NULL pointer dereference in sym53c8xx | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Martin von Gagern <Martin.vGagern> |
Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | gokturk |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=bf6f111b5e891b4cfbd4f966488fd824543ba2aa | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Bug Depends on: | |||
Bug Blocks: | 395535 | ||
Attachments: |
emerge --info
make log Stackj trace screen photo Fix NULL ptr dereferencei n sym538cxx_2 |
(In reply to comment #0) > BUG: unable to handle kernel NULL pointer dereference at 0000000000018 Can you recompile the kernel and look at the build log to see any compiler warnings pointing to this problem? Created attachment 294377 [details] make log (In reply to comment #1) > Can you recompile the kernel and look at the build log to see any compiler > warnings pointing to this problem? Output from "make" in the kernel source tree, after a previous "make clean". I see nothing suspicious there. There are a few warnings, but in what appears to be unrelated code. Created attachment 294381 [details]
Stackj trace screen photo
Won't type this yet, but here is the stack trace as montaged from my photographs.
OK, had a look at the code, and have located where this issue occurs. Quoting sym_glue.c starting at line 835: static void sym53c8xx_slave_destroy(struct scsi_device *sdev) { struct sym_hcb *np = sym_get_hcb(sdev->host); struct sym_tcb *tp = &np->target[sdev->id]; struct sym_lcb *lp = sym_lp(tp, sdev->lun); unsigned long flags; spin_lock_irqsave(np->s.host->host_lock, flags); if (lp->busy_itlq || lp->busy_itl) { At this point, lp (stored in R14) is NULL, so accessing its members will fail. The address 0x18 mentioned in the error message is exactly &((struct sym_lcb*)NULL)->busy_itlq. sym_lp appears to be a macro from sym_hidp.h, which on my machine takes the multi-LUN form. Currently I've got the impression that something in the scsi subsystem tries to destroy a device which hasn't been created before. But that's a wild guess. Will try to reproduce this with vanilla-sources next. (In reply to comment #4) > Will try to reproduce this with vanilla-sources next. Could reproduce with vanilla-sources-3.1.4. Will try to git bisect this, although I can't promise I won't loose patience with all those reboots. (In reply to comment #5) > Will try to git bisect this, > although I can't promise I won't loose patience with all those reboots. Ten reboots later, I got the culprit: http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=bf6f111b5e891b4cfbd4f966488fd824543ba2aa Usually I'd take this to bugzilla.kernel.org, but that seems to be down at the moment. Feel free to forward this upstream once services are restored, or to some mailing list if you think that more useful than bugzilla. Created attachment 294661 [details, diff]
Fix NULL ptr dereferencei n sym538cxx_2
Martin, can you test the attached patch? I have also sent it upstream. If it works, and if you want, a Reported/Tested-by tag could be added in the patch (either this one, if it works and gets accepted, or subsequent patches if needed).
Thanks.
(In reply to comment #7) > Martin, can you test the attached patch? Applied, system boots, dmesg looks sane: sym53c8xx 0000:03:07.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22 [...] sym0: <875> rev 0x3 at pci 0000:03:07.0 irq 22 sym0: Symbios NVRAM, ID 7, Fast-20, SE, parity checking sym0: open drain IRQ line driver, using on-chip SRAM sym0: using LOAD/STORE-based firmware. sym0: SCSI BUS has been reset. scsi6 : sym-2.2.3 scsi target6:0:0: Scan at boot disabled in NVRAM scsi: killing requests for dead queue scsi target6:0:1: Multiple LUNs disabled in NVRAM > If it works, and if you want, a Reported/Tested-by tag could be added in the > patch Yes, please do. Thanks for managing the upstream side of this, and for coming up with this patch! I'm going to bump the upstream thread / patch, and if there's no response, I think we can add it to genpatches (until/if it's get merged upstream). This has made it to 3.0, 3.1 and 3.2 kernels |
Created attachment 294281 [details] emerge --info Upgrading from gentoo-sources 3.1.1 to 3.1.3, my system now fails to boot, printing a kernel stack trace instead. Took a set of photos, but manually copying three screens full of mostly hex data will take some time, so I'd like to postpone that until someone actually has a use for it. The essence appears to be this: BUG: unable to handle kernel NULL pointer dereference at 0000000000018 IP: [<...>] sym53c8xx_slave_destroy+0x68/0x190 [sym53c8xx] ... Pid: 791, comm: modprobe Not tainted 3.1.3-gentoo #1 ... ... Call Trace: [<...>] ? __scsi_remove_device+0x64/0xd0 [<...>] ? scsi_alloc_sdev+0x21c/0x260 ... [<...>] ? sys_init_module+... [<...>] ? system_call_fastpath+... ... note: modprobe[791] exited with preempt_count 1 udevd[661]: '/sbin/modprobe -bv pci:v00001000d0000000Fsv00000000sd00000000bc01sc00i00' [791] terminated by signal 9 (Killed) As there appears to be no change to the code base of the sym53c8xx module, I guess some other and unrelated modification causes this problem. The above was the second boot that failed for me; I'm not sure if the one before failed for the same reason, though, as scrollback didn't work in that case. So there might be different symptoms to the same underlying problem. I'm back to 3.1.1 for now.