Created attachment 868237 [details]
crash log from 20-08-2023
On one of my machines i having sometimes troubles with emerge when trying to do a "emerge -avuDN @world". While today i had 3 crashes in a row i don't think it's a problem with portage-3.0.50 itself (which is why i decided to wirte only sys-apps/portage in the summary). Similar crashes happen in the past. Since most of the time it works with the second try i didn't bother to make a bug report.
Even now this bug is not really important for me since portage works without problems.
Some more Details:
- This machine is a virtual machine. It runs on a gentoo host with 4 other vm's. These other vm's (also gentoo boxes) doesn't have these random crashes so i think there is no hardware problem.
- The crashes seem to happen after some time of running. Usually i update (and restart) the vm's every few days or so (i would say twice a week, sometimes longer). Today the machine was running 2 weeks.
- Apart from the crashes everything works as expected. As mentioned, emerge usually works after the second try anyway.
The crashes from today (i had 3 crashes when calling emerge -avuDN @world) i put into a file called emerge-crash-20-08-23.log. There you can see the 3 crashes from portage, the dmesg output and the output of free.
If you need further details please let me know. The vm's is also still running if i should get some more details.
The dmesg output seems to be always the same:
emerge: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
I've also put logs from 21.06.23 (dmesg, emerge crash and emerge --info) aswell as also from 27.06.23 (only the dmesg output)
Created attachment 868238 [details]
crash log from 21-06-2023
Created attachment 868239 [details]
crash log from 27-06-2023
Userland applications shouldn't be able to crash the kernel.
Your error makes me wonder if it is OOMing or something though.
* Could you humour me and do a memtest on the host when you get a chance?
* Please upgrade to the latest 6.1.x kernel.
(In reply to Sam James from comment #3)
> Userland applications shouldn't be able to crash the kernel.
> Your error makes me wonder if it is OOMing or something though.
> * Could you humour me and do a memtest on the host when you get a chance?
> * Please upgrade to the latest 6.1.x kernel.
Yes, i thought this already too. The host only has 4GB RAM and 4GB swap. But used RAM is mostly at 500MB and since emerge only installs binary package (from my internal binhost) it never compiles packages either.
BTW, i also haven't over provisioned anything (CPU or RAM). The vm's should have no problem with their resources.
Memtest will be difficult. This is a host from hetzner, so i have no physical access. I can order a remote console but that would take time and since there are other services running on it it's going to be difficult.
Sure, i'll going to update the kernel. This was on my todo anyway. FYI: Right now, the host and the guest, running on 6.1.38.
Two days ago i've had another crash. Two in a row to be precise. I saved both crashes in a log including relevant dmesg output. The system was running 6 days that time. After the 3rd time starting emerge it worked as expected, however i've decided not to update.
Today i've run another emerge -avuDN @world (no updates or reboots since then) and it crashed again once (second time now worked again). Again i saved the relevant information in the logs.
I haven't made a memtest yet but it's on my todo list.
All vm's running on 6.1.46-gentoo. Actually they all use the same kernel as i'm passing the kernel directly via qemu.
Something like: qemu-system-x86_64 -name vs4 -monitor unix:/run/kvm/vs4.sock,server=on,wait=off -pidfile /run/kvm/vs4.pid -device virtio-balloon -m 4096 -smp cores=2,threads=1,sockets=1 -machine q35 -k de -cpu host -accel kvm -runas nobody -kernel /mnt/data/vm/kernel/gentoo-latest -initrd /mnt/data/vm/kernel/initrd-v2.cpio.gz ...
Created attachment 870476 [details]
crash log from 11-09-2023
Created attachment 870477 [details]
crash log from 13-09-2023
Some news about this issue.
I found out this isn't a problem with emerge/portage and i will close this issue.
Today i got these crashes again. (got them in the past too, but haven't looked into it) Looking a bit deeper into this issue i found that a simple
# ls /var/db/repose/gentoo
already triggers these crashes with following error:
ls: reading directory '/var/db/repos/gentoo/': Cannot allocate memory
Now, as before, there is still plenty of memory available. The kernel is at 6.6.13 (as i'm doing regularly updates) and everything else works as expected.
/var/db/repos/gentoo/ is actually a 9P filesystem from the host (mounted only ro).
Even a remount (umount & mount) of the 9P filesystem wouldn't fix the issue. Only after some time and trying with `ls`, the filesystem lists again. Other directories could be listed without troubles, only the 9P didn't work.
I have no idea what the problem is. The host itself is also at the latest stable kernel and got rebooted too previously (but i didn't had the time to do a memory check) The exact same configuration is made on the 4 other vm's with no issues at all.
However since i want to switch to virtio-fs anyway i hope this error will be gone for good, which is why i'm not going to dig deeper.