We have been using Gentoo for years now on a large number of VMs/Bare-Metal.
But for the last months/years, we have been facing a bug related to VM Live-Motion of Gentoo guest hosts on XenServer hypervisors. Basically : move a VM from one host to another inside a pool without downtime.
-> Citrix support says : gentoo bug, we can't help
-> We can't reproduce such behaviour on other systems (CentOs/Ubuntu/Debian/Windows Server)
-> We didn't have this bug a long time ago (could not say when it started)
-> We have reproduced this bug on multiple different XenServer clusters (different hardware, network and storage) and versions since 6.5
-> We can reproduce this bug with multiple different kernel versions
-> The bug seems to be less likely when VM is freshly booted
This bug fixed a long time ago is describing more or less what we have on Gentoo + XenServer : https://bugs.gentoo.org/564276
Steps to Reproduce:
1. Run a fresh Gentoo guest with latest kernel on XenServer 7.x
2. Use memory (cat /dev/xvda > /dev/null to fill buffers)
3. Live migrate of the guest
After migration is done, load rises to +infinite and VM dies. Only "hard" reboot resolves the issue.
Any help or lead to resolution would be much appreciated.
Do you run gentoo-sources on them? Which versions have you tested?
We have tested multiple versions since 4.x.
Latest tested is 4.20.4. Same bug on all of them.
(In reply to Gaetan from comment #0)
> We have been using Gentoo for years now on a large number of VMs/Bare-Metal.
> But for the last months/years, we have been facing a bug.
This is a bit confusing. You've been using it for years, but the bug seems to be present for years too? It's not a fresh thing, hasn't it been there since the beginning?
Which favor of kernel do you use? Gentoo-sources / vanilla-sources / ... ?
The bug was NOT present a long time ago (>2y). And is present since then. We we convinced this was coming from our infrastructure or from Xen until now. But our latest tests with other OS tend to prove we were wrong. That's why we are opening this bug @gentoo now.
We are using gentoo-sources flavor.
Do you remember the last kernel version it worked on?
I could not tell which was latest working kernel.
Let me share some info we already shared with Citrix.
1/ Here is one on the Motion "strange" this we could observe. Dmesg result show "weird" dates :
[Wed Mar 6 12:54:15 2019] Freezing user space processes ... (elapsed 0.001 seconds) done.
[Wed Mar 6 12:54:15 2019] OOM killer disabled.
[Wed Mar 6 12:54:15 2019] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[Wed Mar 6 12:54:15 2019] suspending xenstore...
[Sat Jun 29 15:14:46 2019] xen:events: Xen HVM callback vector for event delivery is enabled
[Sat Jun 29 15:14:46 2019] Xen Platform PCI: I/O protocol version 1
[Sat Jun 29 15:14:46 2019] xen:grant_table: Grant tables using version 1 layout
[Sat Jun 29 15:14:46 2019] xen: --> irq=9, pirq=16
[Sat Jun 29 15:14:46 2019] xen: --> irq=8, pirq=17
[Sat Jun 29 15:14:46 2019] xen: --> irq=12, pirq=18
[Sat Jun 29 15:14:46 2019] xen: --> irq=1, pirq=19
[Sat Jun 29 15:14:46 2019] xen: --> irq=6, pirq=20
[Sat Jun 29 15:14:46 2019] xen: --> irq=4, pirq=21
[Sat Jun 29 15:14:46 2019] xen: --> irq=7, pirq=22
[Sat Jun 29 15:14:46 2019] xen: --> irq=23, pirq=23
[Sat Jun 29 15:14:46 2019] xen: --> irq=30, pirq=24
[Sat Jun 29 15:14:46 2019] usb usb1: root hub lost power or was reset
[Sat Jun 29 15:14:46 2019] ata2.01: configured for MWDMA2
[Sat Jun 29 15:14:46 2019] usb 1-2: reset full-speed USB device number 2 using uhci_hcd
[Sat Jun 29 15:14:46 2019] OOM killer enabled.
[Sat Jun 29 15:14:46 2019] Restarting tasks ... done.
[Sat Jun 29 15:14:46 2019] Setting capacity to 41943040
2/ Sometimes, we can live-move a VM 100 times without a crash. Sometimes it crashes instantly. We could not determine why we got either behaviours.
3/ Sometimes, a VM crash a few hours/days after motion.
4/ Most of the times, the crash is a "kernel panic".
Have you tried increasing the grant table size?
Thanks for the suggestion Tomáš. Description is more or less what we are observing.
But, documentation is unclear. Should gnttab_max_frames be set on Dom0 or DomU side ? I suppose Dom0 (Citrix XenServer side).
By the way, I could find this documentation which explains more or less the same issue on Debian side with Xen (not XenServer's) : https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554
It was written for Xen and the setting depends on which version is used.
For xen 4.9 it was global, but in 4.10+ it's a per-domU setting.
Please consult the xenserver docs.
XenServer (as of 7.6) is using Xen 4.7. We will be testing the "grant table size" tuning on Dom0's side and get back to you if we have any interesting feedback.
Let me now if you have any other lead to explore until then.
Please reopen if the problem persists.
After some long time of migration & upgrades, I can confirm that rising : gnttab_max_frames to 256 does not solve the issue.
We are still impacted with random VM crashes when live-moving Gentoo VMs across XenServer members of a same pool...