679532 – sys-kernel/gentoo-sources: Live-Migration fails on XenServer

Bug 679532 - sys-kernel/gentoo-sources: Live-Migration fails on XenServer

Summary: sys-kernel/gentoo-sources: Live-Migration fails on XenServer

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal normal (vote)
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-03-05 16:46 UTC by Gaetan
Modified:	2021-09-07 22:32 UTC (History)
CC List:	1 user (show)

See Also:	564276
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gaetan 2019-03-05 16:46:20 UTC

Hello Everyone,

We have been using Gentoo for years now on a large number of VMs/Bare-Metal.

But for the last months/years, we have been facing a bug related to VM Live-Motion of Gentoo guest hosts on XenServer hypervisors. Basically : move a VM from one host to another inside a pool without downtime.

-> Citrix support says : gentoo bug, we can't help
-> We can't reproduce such behaviour on other systems (CentOs/Ubuntu/Debian/Windows Server)
-> We didn't have this bug a long time ago (could not say when it started)
-> We have reproduced this bug on multiple different XenServer clusters (different hardware, network and storage) and versions since 6.5
-> We can reproduce this bug with multiple different kernel versions
-> The bug seems to be less likely when VM is freshly booted

This bug fixed a long time ago is describing more or less what we have on Gentoo + XenServer : https://bugs.gentoo.org/564276

Reproducible: Always

Steps to Reproduce:
1. Run a fresh Gentoo guest with latest kernel on XenServer 7.x
2. Use memory (cat /dev/xvda > /dev/null to fill buffers)
3. Live migrate of the guest

After migration is done, load rises to +infinite and VM dies. Only "hard" reboot resolves the issue.

Any help or lead to resolution would be much appreciated.

Thanks,

Comment 1 Tomáš Mózes 2019-03-05 17:06:35 UTC

Do you run gentoo-sources on them? Which versions have you tested?

Comment 2 Gaetan 2019-03-06 10:12:12 UTC

We have tested multiple versions since 4.x.

Latest tested is 4.20.4. Same bug on all of them.

Comment 3 Tomáš Mózes 2019-03-06 11:28:09 UTC

(In reply to Gaetan from comment #0)
> We have been using Gentoo for years now on a large number of VMs/Bare-Metal.
> 
> But for the last months/years, we have been facing a bug.

This is a bit confusing. You've been using it for years, but the bug seems to be present for years too? It's not a fresh thing, hasn't it been there since the beginning?

Comment 4 Tomáš Mózes 2019-03-06 11:28:34 UTC

Which favor of kernel do you use? Gentoo-sources / vanilla-sources / ... ?

Comment 5 Gaetan 2019-03-06 11:30:46 UTC

The bug was NOT present a long time ago (>2y). And is present since then. We we convinced this was coming from our infrastructure or from Xen until now. But our latest tests with other OS tend to prove we were wrong. That's why we are opening this bug @gentoo now.

We are using gentoo-sources flavor.

Comment 6 Tomáš Mózes 2019-03-06 11:38:18 UTC

Do you remember the last kernel version it worked on?

Comment 7 Gaetan 2019-03-06 12:00:08 UTC

I could not tell which was latest working kernel.

Let me share some info we already shared with Citrix.


1/ Here is one on the Motion "strange" this we could observe. Dmesg result show "weird" dates :

[Wed Mar  6 12:54:15 2019] Freezing user space processes ... (elapsed 0.001 seconds) done.
[Wed Mar  6 12:54:15 2019] OOM killer disabled.
[Wed Mar  6 12:54:15 2019] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[Wed Mar  6 12:54:15 2019] suspending xenstore...
[Sat Jun 29 15:14:46 2019] xen:events: Xen HVM callback vector for event delivery is enabled
[Sat Jun 29 15:14:46 2019] Xen Platform PCI: I/O protocol version 1
[Sat Jun 29 15:14:46 2019] xen:grant_table: Grant tables using version 1 layout
[Sat Jun 29 15:14:46 2019] xen: --> irq=9, pirq=16
[Sat Jun 29 15:14:46 2019] xen: --> irq=8, pirq=17
[Sat Jun 29 15:14:46 2019] xen: --> irq=12, pirq=18
[Sat Jun 29 15:14:46 2019] xen: --> irq=1, pirq=19
[Sat Jun 29 15:14:46 2019] xen: --> irq=6, pirq=20
[Sat Jun 29 15:14:46 2019] xen: --> irq=4, pirq=21
[Sat Jun 29 15:14:46 2019] xen: --> irq=7, pirq=22
[Sat Jun 29 15:14:46 2019] xen: --> irq=23, pirq=23
[Sat Jun 29 15:14:46 2019] xen: --> irq=30, pirq=24
[Sat Jun 29 15:14:46 2019] usb usb1: root hub lost power or was reset
[Sat Jun 29 15:14:46 2019] ata2.01: configured for MWDMA2
[Sat Jun 29 15:14:46 2019] usb 1-2: reset full-speed USB device number 2 using uhci_hcd
[Sat Jun 29 15:14:46 2019] OOM killer enabled.
[Sat Jun 29 15:14:46 2019] Restarting tasks ... done.
[Sat Jun 29 15:14:46 2019] Setting capacity to 41943040

2/ Sometimes, we can live-move a VM 100 times without a crash. Sometimes it crashes instantly. We could not determine why we got either behaviours.

3/ Sometimes, a VM crash a few hours/days after motion.

4/ Most of the times, the crash is a "kernel panic".

Comment 8 Tomáš Mózes 2019-03-06 14:00:18 UTC

Have you tried increasing the grant table size?

https://wiki.gentoo.org/wiki/Xen#Xen_domU_hanging_with_kernel_4.3.2B

Comment 9 Gaetan 2019-03-06 15:26:08 UTC

Thanks for the suggestion Tomáš. Description is more or less what we are observing.

But, documentation is unclear. Should gnttab_max_frames be set on Dom0 or DomU side ? I suppose Dom0 (Citrix XenServer side).

By the way, I could find this documentation which explains more or less the same issue on Debian side with Xen (not XenServer's) : https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554

Gaëtan

Comment 10 Tomáš Mózes 2019-03-06 16:23:32 UTC

It was written for Xen and the setting depends on which version is used.

For xen 4.9 it was global, but in 4.10+ it's a per-domU setting.

Please consult the xenserver docs.

Comment 11 Gaetan 2019-03-06 17:21:50 UTC

XenServer (as of 7.6) is using Xen 4.7. We will be testing the "grant table size" tuning on Dom0's side and get back to you if we have any interesting feedback.

Let me now if you have any other lead to explore until then.

Comment 12 Tomáš Mózes 2019-03-15 13:48:24 UTC

Please reopen if the problem persists.

Comment 13 Gaetan 2019-05-07 15:30:29 UTC

Hello,

After some long time of migration & upgrades, I can confirm that rising : gnttab_max_frames to 256 does not solve the issue.

We are still impacted with random VM crashes when live-moving Gentoo VMs across XenServer members of a same pool...

Comment 14 Mike Pagano gentoo-dev

2021-08-24 21:46:34 UTC

Is this still an issue?