Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 744349 - app-emulation/xen crashes dom0 under heavy disk load
Summary: app-emulation/xen crashes dom0 under heavy disk load
Status: RESOLVED NEEDINFO
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: AMD64 Linux
: Normal normal (vote)
Assignee: Tomáš Mózes
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-23 14:59 UTC by Krzysztof
Modified: 2020-11-30 14:15 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Krzysztof 2020-09-23 14:59:19 UTC
hello,
xen 4.13 and 4.14 crashes (hard lock) dom0 under heavy disk load, gentoo-sources-5.4.60 in dom0, steps to reproduce:

- limit dom0 memory with dom0_mem=1024M,max:1024M and run no VMs
OR
- run some VMs, so there is not much memory left in dom0
AND
- turn on swap in dom0 with swapon
- compile gcc in dom0 with 15 threads (MAKEOPTS="-j15")

I have two Intel based machines that I can reproduce with.

xen 4.12.3-r3 was removed from portage few days ago, xen 4.12 is perfectly stable in above conditions.

It is possible to put xen 4.12 in portage until resolution?

Regards,
Krzysztof
Comment 1 Tomáš Mózes 2020-09-23 15:16:32 UTC
@mgorny, can you please restore Xen 4.12 for some time? I know it's python2 only, but I had 2 user reports stating 4.13+ doesn't work well for them.
Comment 2 Tomáš Mózes 2020-09-23 15:18:19 UTC
(In reply to Krzysztof from comment #0)
> hello,
> xen 4.13 and 4.14 crashes (hard lock) dom0 under heavy disk load,
> gentoo-sources-5.4.60 in dom0, steps to reproduce:
> 
> - limit dom0 memory with dom0_mem=1024M,max:1024M and run no VMs
> OR
> - run some VMs, so there is not much memory left in dom0
> AND
> - turn on swap in dom0 with swapon
> - compile gcc in dom0 with 15 threads (MAKEOPTS="-j15")
> 
> I have two Intel based machines that I can reproduce with.
> 
> xen 4.12.3-r3 was removed from portage few days ago, xen 4.12 is perfectly
> stable in above conditions.
> 
> It is possible to put xen 4.12 in portage until resolution?
> 
> Regards,
> Krzysztof

Thanks for the report. Which scheduler are you running (credit/credit2)?
Comment 3 Krzysztof 2020-09-23 15:22:57 UTC
@Tomáš Mózes 
I restored 4.12 by overwriting portage tree with snapshot @ 15.09.2020 and reemerging xen-4.12.3-r3

currently I 'am forcing credit2, but 4.13 with forced sched=credit2 also crashes.
Comment 4 Tomáš Mózes 2020-09-23 15:23:48 UTC
@Krzysztof, how large did you set your swap to?
Comment 5 Tomáš Mózes 2020-09-23 15:24:22 UTC
(In reply to Krzysztof from comment #3)
> @Tomáš Mózes 
> I restored 4.12 by overwriting portage tree with snapshot @ 15.09.2020 and
> reemerging xen-4.12.3-r3
> 
> currently I 'am forcing credit2, but 4.13 with forced sched=credit2 also
> crashes.

Caan you please try credit (legacy)? We had issues with credit2 before.
Comment 6 Krzysztof 2020-09-23 15:25:52 UTC
My swap is 4GB, I will reemerge 4.13 and will try sched=credit, give me 15min
Comment 7 Krzysztof 2020-09-23 15:48:02 UTC
So I'm on xen-4.13.1-r3 with forced sched=credit at bootloader

Compiling gcc, we should now if it's stable in 10min
Comment 8 Krzysztof 2020-09-23 16:21:30 UTC
So xen-4.13.1-r3 with forced shed=credit (in grub.cfg) seems to work :)

I will update here in few hours, whenever it crashed or not.
Comment 9 Krzysztof 2020-09-23 16:23:10 UTC
It crashed :/
Comment 10 Tomáš Mózes 2020-09-23 17:39:10 UTC
How long did it take to crash?
Comment 11 Krzysztof 2020-09-23 18:06:15 UTC
On my machine it's 10 to 18min of compilation
Comment 12 Tomáš Mózes 2020-09-23 18:14:21 UTC
Just testing with latest xen patches (https://github.com/gentoo/gentoo/pull/17638) with 1G of RAM on dom0 with 10G of swap and so far no lockup. My grub config is:

GRUB_CMDLINE_LINUX="panic=30 net.ifnames=0"
GRUB_CMDLINE_XEN="dom0_mem=1G gnttab_max_frames=256 ucode=scan loglvl=all guest_loglvl=all console_to_ring console_timestamps=date conring_size=1m smt=true sched=credit iommu=no-intremap"
Comment 13 Krzysztof 2020-09-23 18:33:40 UTC
I'm testing bare metal (second time, without xen) to be sure.

I'll get back in hour.
Comment 14 Krzysztof 2020-09-23 19:44:40 UTC
so it's like:
gentoo-sources-5.4.60 without xen, with kernel param mem=1G (to simulate low on RAM) works ok

xen-4.13.1-r3 with or without sched=credit/credit2 crashes
xen-4.12.3-r3 with or without sched=credit/credit2 crashes, takes more time

now I'm trying https://github.com/gentoo/gentoo/pull/17638

I'll get back
Comment 15 Tomáš Mózes 2020-09-23 19:48:56 UTC
So it's probably an older issue, not just with 4.13+.

My intel machine got really slow after 2 hours, but I managed to ssh in and kill the gcc compilation (load was around 25), so cannot reproduce here for now.

Mind sharing your kernel/xen command line options? From grub or whatever you use.
Comment 16 Krzysztof 2020-09-23 21:36:47 UTC
yea, I'm using custom grub efi compilation, it's signed by custom keys - secureboot

then it loads xen, vmlinuz, initrd also signed, this time by gpg, so it's bit complicated but tested and rock solid

so for xen.gz it's only dom0_mem=1024M,max:1024M
for kernel it's root=... ro root=UUID=... rd.luks.uuid=... quiet splash softlevel=xen

what's funny, it looks like it's another platform with borked BIOS

I'm sitting at BOXNUC7PJYH2 the Intel NUC

so running https://github.com/gentoo/gentoo/pull/17638 with xen.gz parameters like you propose: dom0_mem=1G gnttab_max_frames=256 ucode=scan loglvl=all guest_loglvl=all console_to_ring console_timestamps=date conring_size=1m smt=true sched=credit iommu=no-intremap

believe it or not - it is working now :) or at least I'm compiling fo 32min now

tommorow I'll try to isolate parameter that makes the difference, I'm betting iommu=no-intremap

I'll get back
Comment 17 Tomáš Mózes 2020-09-24 05:26:24 UTC
The gnttab_max_frames parameter was mostly needed for domU (https://wiki.gentoo.org/wiki/Xen#Xen_domU_hanging_with_kernel_4.3.2B). 

iommu=no-intremap is a workaround for the server itself - it's an HP DL360G7 server (https://support.citrix.com/article/CTX136517).

sched=credit is a workaround for the scheduler that changed default value from credit -> credit2 (https://wiki.gentoo.org/wiki/Xen#Xen_domU_hanging_with_Xen_4.12.2B)
Comment 18 Krzysztof 2020-09-24 08:16:49 UTC
So it crashed after 99min.

At this point I think it's platform specific.

Main take away for anybody reading: unless you use swap, you should be safe, it's not strange for my machines to see 300+ days uptime, 8-10 domU's, some of them make 50GB of traffic on NICs. It's only recently that I had to use swap and discovered this bug.

I will test some more and share my discoveries here

massive THANK YOU to Tomáš, for his help
Comment 19 Tomáš Mózes 2020-09-24 09:47:54 UTC
(In reply to Krzysztof from comment #18)
> So it crashed after 99min.
> 
> At this point I think it's platform specific.
> 
> Main take away for anybody reading: unless you use swap, you should be safe,
> it's not strange for my machines to see 300+ days uptime, 8-10 domU's, some
> of them make 50GB of traffic on NICs. It's only recently that I had to use
> swap and discovered this bug.
> 
> I will test some more and share my discoveries here
> 
> massive THANK YOU to Tomáš, for his help

Are you doing live patching or missing security fixes? :)

If you reboot your machine, will it behave the same? Maybe it only appears if you have such a high uptime. If you observe the same behavior after the reboot, it would be worth writing to xen-users mailing list for advice.
Comment 20 Krzysztof 2020-09-24 14:14:15 UTC
Yes, I use all kinds of trickery to keep it up to date and not reboot :)

The machine I'm testing now is rebooted after every crash.

I sent e-mail to xen-users, will see if someone can be in help.

I'll get back.
Comment 21 Tomáš Mózes 2020-10-21 09:09:32 UTC
Any news here?