Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 477258 - =sys-kernel/gentoo-sources-3.10.1 - BUG: Bad page state in process libvirtd pfn:76000
Summary: =sys-kernel/gentoo-sources-3.10.1 - BUG: Bad page state in process libvirtd ...
Status: RESOLVED UPSTREAM
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: https://bugzilla.kernel.org/show_bug....
Whiteboard: watch-linux-bugzilla
Keywords: UPSTREAM
Depends on:
Blocks:
 
Reported: 2013-07-17 23:05 UTC by Alexandr Tiurin
Modified: 2013-09-13 16:05 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments
dmesg (dmesg,134.60 KB, text/plain)
2013-07-17 23:05 UTC, Alexandr Tiurin
Details
emerge --info libvirt (eminfo,14.49 KB, text/plain)
2013-07-17 23:09 UTC, Alexandr Tiurin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexandr Tiurin 2013-07-17 23:05:07 UTC
Also with gentoo-sources-3.8.13

Reproducible: Always

Steps to Reproduce:
1.lspci  | grep -i ethernet
02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
2. relaxed_acs_check = 1 in /etc/libvirt/qemu.conf
3. libvitd start
4. virsh nodedev-dettach pci_0000_02_00_1

Actual Results:  
...cut...

[   84.371595] igb 0000:02:00.1: removed PHC on eth1
[   84.385984] BUG: Bad page state in process libvirtd  pfn:76000
[   84.385991] page:ffffea0001d80000 count:0 mapcount:0 mapping:          (null) index:0x0
[   84.385994] page flags: 0x3ff00000000400(reserved)
[   84.385999] Modules linked in: mperf coretemp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw iTCO_wdt iTCO_vendor_support igb i2c_algo_bit sb_edac edac_core pcspkr hpilo i2c_core microcode lpc_ich mfd_core hpwdt joydev ioatdma dca acpi_power_meter dm_thin_pool dm_bio_prison dm_persistent_data dm_service_time dm_round_robin dm_queue_length dm_multipath dm_log_userspace dm_flakey dm_delay dm_bufio iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi tg3 ptp pps_core e1000 fuse xfs nfs fscache dns_resolver lockd sunrpc jfs reiserfs btrfs libcrc32c zlib_deflate multipath linear raid0 dm_raid raid10 raid1 raid456 async_raid6_recov async_pq async_xor xor raid6_pq async_memcpy async_tx dm_crypt hid_sunplus hid_sony hid_samsung hid_pl hid_petalynx hid_gyration sl811_hcd usb_storage
[   84.386080]  mpt2sas raid_class aic94xx libsas lpfc qla2xxx megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid sx8 DAC960 hpsa cciss 3w_9xxx 3w_xxxx mptsas scsi_transport_sas mptfc scsi_transport_fc scsi_tgt mptspi mptscsih mptbase atp870u dc395x qla1280 dmx3191d sym53c8xx gdth advansys initio BusLogic arcmsr aic7xxx aic79xx scsi_transport_spi sata_inic162x sata_sil24
[   84.386123] CPU: 9 PID: 9514 Comm: libvirtd Not tainted 3.10.1-gentoo #1
[   84.386125] Hardware name: HP ProLiant DL380e Gen8, BIOS P73 08/20/2012
[   84.386128]  0000000000000001 ffff88036d101b78 ffffffff81655219 ffff88036d101b98
[   84.386135]  ffffffff81139d57 ffff88036d101b98 ffffea0001d80000 ffff88036d101be8
[   84.386140]  ffffffff8113a38e ffff88036d101c68 0000000000000000 ffff88036d096840
[   84.386145] Call Trace:
[   84.386156]  [<ffffffff81655219>] dump_stack+0x19/0x1b
[   84.386164]  [<ffffffff81139d57>] bad_page+0xc7/0x120
[   84.386170]  [<ffffffff8113a38e>] free_pages_prepare+0x13e/0x150
[   84.386175]  [<ffffffff8113bd70>] free_hot_cold_page+0x40/0x150
[   84.386181]  [<ffffffff8113becf>] __free_pages+0x4f/0x80
[   84.386186]  [<ffffffff8113c146>] free_pages+0x66/0x70
[   84.386193]  [<ffffffff81528369>] dma_pte_free_pagetable+0x1e9/0x2f0
[   84.386199]  [<ffffffff8152960c>] domain_exit+0x7c/0x1a0
[   84.386204]  [<ffffffff8152b693>] device_notifier+0x93/0xa0
[   84.386212]  [<ffffffff8165f08d>] notifier_call_chain+0x4d/0x70
[   84.386219]  [<ffffffff810892e8>] __blocking_notifier_call_chain+0x58/0x80
[   84.386226]  [<ffffffff81089326>] blocking_notifier_call_chain+0x16/0x20
[   84.386233]  [<ffffffff813e607a>] __device_release_driver+0xca/0xe0
[   84.386237]  [<ffffffff813e60bc>] device_release_driver+0x2c/0x40
[   84.386242]  [<ffffffff813e4eb1>] driver_unbind+0xa1/0xc0
[   84.386247]  [<ffffffff813e4344>] drv_attr_store+0x24/0x40
[   84.386254]  [<ffffffff8120d49f>] sysfs_write_file+0xef/0x170
[   84.386261]  [<ffffffff8119b79e>] vfs_write+0xce/0x1e0
[   84.386266]  [<ffffffff8119bc82>] SyS_write+0x52/0xa0
[   84.386272]  [<ffffffff81663799>] system_call_fastpath+0x16/0x1b
[   84.386275] Disabling lock debugging due to kernel taint
[   84.386278] BUG: Bad page state in process libvirtd  pfn:76200
[   84.386281] page:ffffea0001d88000 count:0 mapcount:0 mapping:          (null) index:0x0
[   84.386283] page flags: 0x3ff00000000400(reserved)

...cut...

[   84.389640] ------------[ cut here ]------------
[   84.390602] kernel BUG at mm/page_alloc.c:2724!
[   84.391540] invalid opcode: 0000 [#1] SMP 
[   84.392455] Modules linked in: mperf coretemp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw iTCO_wdt iTCO_vendor_support igb i2c_algo_bit sb_edac edac_core pcspkr hpilo i2c_core microcode lpc_ich mfd_core hpwdt joydev ioatdma dca acpi_power_meter dm_thin_pool dm_bio_prison dm_persistent_data dm_service_time dm_round_robin dm_queue_length dm_multipath dm_log_userspace dm_flakey dm_delay dm_bufio iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi tg3 ptp pps_core e1000 fuse xfs nfs fscache dns_resolver lockd sunrpc jfs reiserfs btrfs libcrc32c zlib_deflate multipath linear raid0 dm_raid raid10 raid1 raid456 async_raid6_recov async_pq async_xor xor raid6_pq async_memcpy async_tx dm_crypt hid_sunplus hid_sony hid_samsung hid_pl hid_petalynx hid_gyration sl811_hcd usb_storage
[   84.409936]  mpt2sas raid_class aic94xx libsas lpfc qla2xxx megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid sx8 DAC960 hpsa cciss 3w_9xxx 3w_xxxx mptsas scsi_transport_sas mptfc scsi_transport_fc scsi_tgt mptspi mptscsih mptbase atp870u dc395x qla1280 dmx3191d sym53c8xx gdth advansys initio BusLogic arcmsr aic7xxx aic79xx scsi_transport_spi sata_inic162x sata_sil24
[   84.417960] CPU: 9 PID: 9514 Comm: libvirtd Tainted: G    B        3.10.1-gentoo #1
[   84.419537] Hardware name: HP ProLiant DL380e Gen8, BIOS P73 08/20/2012
[   84.420897] task: ffff88035e29c4a0 ti: ffff88036d100000 task.ti: ffff88036d100000
[   84.422445] RIP: 0010:[<ffffffff8113c14d>]  [<ffffffff8113c14d>] free_pages+0x6d/0x70
[   84.424099] RSP: 0018:ffff88036d101c58  EFLAGS: 00010246
[   84.425191] RAX: 0000000000000000 RBX: ffff880078000000 RCX: ffff88037ffea000
[   84.426667] RDX: ffff88037ffea1e0 RSI: 0000000000000000 RDI: 0000000000000000
[   84.428137] RBP: ffff88036d101c78 R08: 00000000000001ff R09: 0000000000000000
[   84.429907] R10: 00000000000006be R11: 00000000000006bd R12: 0000000000078200
[   84.431381] R13: 0000000000000200 R14: 0000000fffffffff R15: ffff880000000000
[   84.432849] FS:  00007f82d21fb700(0000) GS:ffff88037f520000(0000) knlGS:0000000000000000
[   84.434516] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   84.435700] CR2: 00007fc6a7713350 CR3: 000000035bc07000 CR4: 00000000000407e0
[   84.437168] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   84.438635] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   84.440111] Stack:
[   84.440524]  ffff88036d096800 0000000000000000 ffff88036d101ca8 ffff88036909ee00
[   84.442194]  ffff88036d101ce8 ffffffff81528369 00000000000001ff 0000000000000000
[   84.443866]  ffffffffffffffff ffff88036d083500 ffff88036909e000 0000000200000004
[   84.445535] Call Trace:
[   84.446043]  [<ffffffff81528369>] dma_pte_free_pagetable+0x1e9/0x2f0
[   84.447362]  [<ffffffff8152960c>] domain_exit+0x7c/0x1a0
[   84.448466]  [<ffffffff8152b693>] device_notifier+0x93/0xa0
[   84.449614]  [<ffffffff8165f08d>] notifier_call_chain+0x4d/0x70
[   84.450834]  [<ffffffff810892e8>] __blocking_notifier_call_chain+0x58/0x80
[   84.452254]  [<ffffffff81089326>] blocking_notifier_call_chain+0x16/0x20
[   84.453644]  [<ffffffff813e607a>] __device_release_driver+0xca/0xe0
[   84.454943]  [<ffffffff813e60bc>] device_release_driver+0x2c/0x40
[   84.456198]  [<ffffffff813e4eb1>] driver_unbind+0xa1/0xc0
[   84.457313]  [<ffffffff813e4344>] drv_attr_store+0x24/0x40
[   84.458455]  [<ffffffff8120d49f>] sysfs_write_file+0xef/0x170
[   84.513599]  [<ffffffff8119b79e>] vfs_write+0xce/0x1e0
[   84.565410]  [<ffffffff8119bc82>] SyS_write+0x52/0xa0
[   84.604657]  [<ffffffff81663799>] system_call_fastpath+0x16/0x1b
[   84.643869] Code: 0f 42 05 e7 6e ad 00 48 bf 00 00 00 00 00 ea ff ff 48 01 c3 48 c1 eb 0c 48 c1 e3 06 48 01 df e8 3a fd ff ff 48 83 c4 18 5b 5d c3 <0f> 0b 90 66 66 66 66 90 55 48 89 e5 41 54 4c 8d a6 ff 0f 00 00 
[   84.726419] RIP  [<ffffffff8113c14d>] free_pages+0x6d/0x70
[   84.766629]  RSP <ffff88036d101c58>
[   84.805974] ---[ end trace 5f8ee43b2d6a75dc ]---
Comment 1 Alexandr Tiurin 2013-07-17 23:05:59 UTC
Created attachment 353546 [details]
dmesg
Comment 2 Alexandr Tiurin 2013-07-17 23:09:21 UTC
Created attachment 353548 [details]
emerge --info libvirt
Comment 3 Alexandr Tiurin 2013-07-18 20:48:38 UTC
And also with 3.2.46-hardened-r1.
Comment 4 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-07-19 16:05:17 UTC
Not that it is the same bug, but I found the same BUG message here:

https://bugzilla.redhat.com/show_bug.cgi?id=789993

That's a totally different issue there; yet, it is useful as a comparison to understand where exactly this comes from. As we can, the both commonly share the bad_page function; which appears to be a function called when there is a bad page, this function in particular causes the BUG message.

The function above that is no longer common, that is free_pages_prepare+0x13e/0x150; so, this is the piece of code where the bad_page function is being called because a bad page has been found. If we read the functions below that we see that it is freeing pages because it is exiting a domain because it has to release the driver. Okay, but does this tell us anything; I'm afraid not, where does this bad page come from? That's the question and I have no idea how to figure that part out.

As you can see in the other bug they mention that we're just looking at the consumer here (libvirtd being the consumer in our case) and not by what causes this; so, all we conclude is that this looks more like some kind of memory corruption. Where this memory corruption comes from is a bit tricky to figure out; especially, since the page won't point out who made that page. Long story short, that needs quite some debugging and kernel hacking to figure out; so, let's not go down that road while we can...

Since you state it is always reproducible; we can better look at which kernel was the last kernel that worked for you, to get some kind of idea in which range of commits the bad commit could lie. From there on, a git bisect (http://wiki.gentoo.org/wiki/Kernel_git-bisect) can be done between those two kernels (picking the last working kernel as good and the first failing kernel as bad); that would after some tries point out the bad commit, and it is from that commit that we can deduce how the memory corruption was introduced in your scenario.

All this assumes that you don't have hardware problems such as failing memory; so, make sure you confirm that the old kernel where you did not have this problem actually works. If it does also have the problem; I'd advise you to run an extensive memory test to figure out if your computer has faulty memory.
Comment 5 Alexandr Tiurin 2013-07-20 10:38:56 UTC
Thanks for the detailed answer. So , I tried detach NIC with linux-3.0.86-vanilla, linux-3.0.86-gentoo, linux-3.8.13-vanilla, linux-3.10.1-gentoo, linux-3.2.46-hardened-r1 and linux-2.6.32-openvz-078.28. With only linux-2.6.32-openvz-078.28 "virsh nodedev-dettach pci_0000_02_00_1" works fine. The Openvz is a patched kernel very much. So I tried to find a patch in https://launchpadlibrarian.net/142761680/linux_3.2.0-49.75.diff.gz, because on Ubuntu Precise detach NIC works fine. But the linux_3.2.0-49.75.diff is very biggest and contain many patches of iommu.c and etc. I continue to look for fix.
Comment 6 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-08-12 21:26:40 UTC
While at it, could you please test the latest stable kernel gentoo-sources-3.10.6 (contains over 100 fixes, plus the fixes from earlier releases) as well as test the development kernel git-sources-3.11_rc5?

Thank you in advance.
Comment 7 Alexandr Tiurin 2013-08-13 15:06:54 UTC
Works fine with gentoo-sources-3.10.6 and git-sources-3.11_rc5 too. Thanks!
Comment 8 Alexandr Tiurin 2013-08-22 17:56:17 UTC
Sorry, it is my mistake.  I forgot to add intel_iommu=on to the kernel boot command line while testing. So, I confirm the bug with gentoo-sources-3.10.6 and git-sources-3.11_rc5.
Comment 9 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-09-03 17:30:11 UTC
Have not found anything upstream about this; so, I would suggest you to again try the latest kernels in the hope that this has been since fixed. If not, could you please file this bug upstream at http://bugzilla.kernel.org/ and leave us a link behind to that bug? Thank you in advance.
Comment 10 Alexandr Tiurin 2013-09-04 21:11:40 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=60850
Comment 11 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-09-13 16:05:14 UTC
(In reply to Alexandr Tiurin from comment #10)
> https://bugzilla.kernel.org/show_bug.cgi?id=60850

Comment 1 at the upstream bug asks you to reassign; given that this has to do with the network as you are deattaching NIC (which "[   72.706143] igb 0000:02:00.1: removed PHC on em2" shows right before it), I think you will want to assign this to the Product "Drivers" and the Component "Network"; for assignee, you should have "drivers_network@kernel-bugs.osdl.org".

Thank you for filing this upstream and good luck with the reassignment.