Summary: | =sys-kernel/gentoo-sources-3.10.1 - BUG: Bad page state in process libvirtd pfn:76000 | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Alexandr Tiurin <alexanderyt> |
Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | RESOLVED UPSTREAM | ||
Severity: | normal | Keywords: | UPSTREAM |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | https://bugzilla.kernel.org/show_bug.cgi?id=60850 | ||
Whiteboard: | watch-linux-bugzilla | ||
Package list: | Runtime testing required: | --- | |
Attachments: |
dmesg
emerge --info libvirt |
Description
Alexandr Tiurin
2013-07-17 23:05:07 UTC
Created attachment 353546 [details]
dmesg
Created attachment 353548 [details]
emerge --info libvirt
And also with 3.2.46-hardened-r1. Not that it is the same bug, but I found the same BUG message here: https://bugzilla.redhat.com/show_bug.cgi?id=789993 That's a totally different issue there; yet, it is useful as a comparison to understand where exactly this comes from. As we can, the both commonly share the bad_page function; which appears to be a function called when there is a bad page, this function in particular causes the BUG message. The function above that is no longer common, that is free_pages_prepare+0x13e/0x150; so, this is the piece of code where the bad_page function is being called because a bad page has been found. If we read the functions below that we see that it is freeing pages because it is exiting a domain because it has to release the driver. Okay, but does this tell us anything; I'm afraid not, where does this bad page come from? That's the question and I have no idea how to figure that part out. As you can see in the other bug they mention that we're just looking at the consumer here (libvirtd being the consumer in our case) and not by what causes this; so, all we conclude is that this looks more like some kind of memory corruption. Where this memory corruption comes from is a bit tricky to figure out; especially, since the page won't point out who made that page. Long story short, that needs quite some debugging and kernel hacking to figure out; so, let's not go down that road while we can... Since you state it is always reproducible; we can better look at which kernel was the last kernel that worked for you, to get some kind of idea in which range of commits the bad commit could lie. From there on, a git bisect (http://wiki.gentoo.org/wiki/Kernel_git-bisect) can be done between those two kernels (picking the last working kernel as good and the first failing kernel as bad); that would after some tries point out the bad commit, and it is from that commit that we can deduce how the memory corruption was introduced in your scenario. All this assumes that you don't have hardware problems such as failing memory; so, make sure you confirm that the old kernel where you did not have this problem actually works. If it does also have the problem; I'd advise you to run an extensive memory test to figure out if your computer has faulty memory. Thanks for the detailed answer. So , I tried detach NIC with linux-3.0.86-vanilla, linux-3.0.86-gentoo, linux-3.8.13-vanilla, linux-3.10.1-gentoo, linux-3.2.46-hardened-r1 and linux-2.6.32-openvz-078.28. With only linux-2.6.32-openvz-078.28 "virsh nodedev-dettach pci_0000_02_00_1" works fine. The Openvz is a patched kernel very much. So I tried to find a patch in https://launchpadlibrarian.net/142761680/linux_3.2.0-49.75.diff.gz, because on Ubuntu Precise detach NIC works fine. But the linux_3.2.0-49.75.diff is very biggest and contain many patches of iommu.c and etc. I continue to look for fix. While at it, could you please test the latest stable kernel gentoo-sources-3.10.6 (contains over 100 fixes, plus the fixes from earlier releases) as well as test the development kernel git-sources-3.11_rc5? Thank you in advance. Works fine with gentoo-sources-3.10.6 and git-sources-3.11_rc5 too. Thanks! Sorry, it is my mistake. I forgot to add intel_iommu=on to the kernel boot command line while testing. So, I confirm the bug with gentoo-sources-3.10.6 and git-sources-3.11_rc5. Have not found anything upstream about this; so, I would suggest you to again try the latest kernels in the hope that this has been since fixed. If not, could you please file this bug upstream at http://bugzilla.kernel.org/ and leave us a link behind to that bug? Thank you in advance. (In reply to Alexandr Tiurin from comment #10) > https://bugzilla.kernel.org/show_bug.cgi?id=60850 Comment 1 at the upstream bug asks you to reassign; given that this has to do with the network as you are deattaching NIC (which "[ 72.706143] igb 0000:02:00.1: removed PHC on em2" shows right before it), I think you will want to assign this to the Product "Drivers" and the Component "Network"; for assignee, you should have "drivers_network@kernel-bugs.osdl.org". Thank you for filing this upstream and good luck with the reassignment. |