Summary: | app-emulation/xen crashes dom0 under heavy disk load | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Krzysztof <calypso2k> |
Component: | Current packages | Assignee: | Tomáš Mózes <hydrapolic> |
Status: | RESOLVED NEEDINFO | ||
Severity: | normal | CC: | mgorny, proxy-maint, xen |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | AMD64 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- |
Description
Krzysztof
2020-09-23 14:59:19 UTC
@mgorny, can you please restore Xen 4.12 for some time? I know it's python2 only, but I had 2 user reports stating 4.13+ doesn't work well for them. (In reply to Krzysztof from comment #0) > hello, > xen 4.13 and 4.14 crashes (hard lock) dom0 under heavy disk load, > gentoo-sources-5.4.60 in dom0, steps to reproduce: > > - limit dom0 memory with dom0_mem=1024M,max:1024M and run no VMs > OR > - run some VMs, so there is not much memory left in dom0 > AND > - turn on swap in dom0 with swapon > - compile gcc in dom0 with 15 threads (MAKEOPTS="-j15") > > I have two Intel based machines that I can reproduce with. > > xen 4.12.3-r3 was removed from portage few days ago, xen 4.12 is perfectly > stable in above conditions. > > It is possible to put xen 4.12 in portage until resolution? > > Regards, > Krzysztof Thanks for the report. Which scheduler are you running (credit/credit2)? @Tomáš Mózes I restored 4.12 by overwriting portage tree with snapshot @ 15.09.2020 and reemerging xen-4.12.3-r3 currently I 'am forcing credit2, but 4.13 with forced sched=credit2 also crashes. @Krzysztof, how large did you set your swap to? (In reply to Krzysztof from comment #3) > @Tomáš Mózes > I restored 4.12 by overwriting portage tree with snapshot @ 15.09.2020 and > reemerging xen-4.12.3-r3 > > currently I 'am forcing credit2, but 4.13 with forced sched=credit2 also > crashes. Caan you please try credit (legacy)? We had issues with credit2 before. My swap is 4GB, I will reemerge 4.13 and will try sched=credit, give me 15min So I'm on xen-4.13.1-r3 with forced sched=credit at bootloader Compiling gcc, we should now if it's stable in 10min So xen-4.13.1-r3 with forced shed=credit (in grub.cfg) seems to work :) I will update here in few hours, whenever it crashed or not. It crashed :/ How long did it take to crash? On my machine it's 10 to 18min of compilation Just testing with latest xen patches (https://github.com/gentoo/gentoo/pull/17638) with 1G of RAM on dom0 with 10G of swap and so far no lockup. My grub config is: GRUB_CMDLINE_LINUX="panic=30 net.ifnames=0" GRUB_CMDLINE_XEN="dom0_mem=1G gnttab_max_frames=256 ucode=scan loglvl=all guest_loglvl=all console_to_ring console_timestamps=date conring_size=1m smt=true sched=credit iommu=no-intremap" I'm testing bare metal (second time, without xen) to be sure. I'll get back in hour. so it's like: gentoo-sources-5.4.60 without xen, with kernel param mem=1G (to simulate low on RAM) works ok xen-4.13.1-r3 with or without sched=credit/credit2 crashes xen-4.12.3-r3 with or without sched=credit/credit2 crashes, takes more time now I'm trying https://github.com/gentoo/gentoo/pull/17638 I'll get back So it's probably an older issue, not just with 4.13+. My intel machine got really slow after 2 hours, but I managed to ssh in and kill the gcc compilation (load was around 25), so cannot reproduce here for now. Mind sharing your kernel/xen command line options? From grub or whatever you use. yea, I'm using custom grub efi compilation, it's signed by custom keys - secureboot then it loads xen, vmlinuz, initrd also signed, this time by gpg, so it's bit complicated but tested and rock solid so for xen.gz it's only dom0_mem=1024M,max:1024M for kernel it's root=... ro root=UUID=... rd.luks.uuid=... quiet splash softlevel=xen what's funny, it looks like it's another platform with borked BIOS I'm sitting at BOXNUC7PJYH2 the Intel NUC so running https://github.com/gentoo/gentoo/pull/17638 with xen.gz parameters like you propose: dom0_mem=1G gnttab_max_frames=256 ucode=scan loglvl=all guest_loglvl=all console_to_ring console_timestamps=date conring_size=1m smt=true sched=credit iommu=no-intremap believe it or not - it is working now :) or at least I'm compiling fo 32min now tommorow I'll try to isolate parameter that makes the difference, I'm betting iommu=no-intremap I'll get back The gnttab_max_frames parameter was mostly needed for domU (https://wiki.gentoo.org/wiki/Xen#Xen_domU_hanging_with_kernel_4.3.2B). iommu=no-intremap is a workaround for the server itself - it's an HP DL360G7 server (https://support.citrix.com/article/CTX136517). sched=credit is a workaround for the scheduler that changed default value from credit -> credit2 (https://wiki.gentoo.org/wiki/Xen#Xen_domU_hanging_with_Xen_4.12.2B) So it crashed after 99min. At this point I think it's platform specific. Main take away for anybody reading: unless you use swap, you should be safe, it's not strange for my machines to see 300+ days uptime, 8-10 domU's, some of them make 50GB of traffic on NICs. It's only recently that I had to use swap and discovered this bug. I will test some more and share my discoveries here massive THANK YOU to Tomáš, for his help (In reply to Krzysztof from comment #18) > So it crashed after 99min. > > At this point I think it's platform specific. > > Main take away for anybody reading: unless you use swap, you should be safe, > it's not strange for my machines to see 300+ days uptime, 8-10 domU's, some > of them make 50GB of traffic on NICs. It's only recently that I had to use > swap and discovered this bug. > > I will test some more and share my discoveries here > > massive THANK YOU to Tomáš, for his help Are you doing live patching or missing security fixes? :) If you reboot your machine, will it behave the same? Maybe it only appears if you have such a high uptime. If you observe the same behavior after the reboot, it would be worth writing to xen-users mailing list for advice. Yes, I use all kinds of trickery to keep it up to date and not reboot :) The machine I'm testing now is rebooted after every crash. I sent e-mail to xen-users, will see if someone can be in help. I'll get back. Any news here? |