912660 – sys-apps/portage: random crashes when running emerge

Bug 912660 - sys-apps/portage: random crashes when running emerge

Summary: sys-apps/portage: random crashes when running emerge

Status:	RESOLVED OBSOLETE

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Portage team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2023-08-20 08:16 UTC by Michael Mair-Keimberger (iamnr3)
Modified:	2024-02-18 10:55 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
crash log from 20-08-2023 (emerge-crash-20-08-23.log,21.76 KB, text/x-log) 2023-08-20 08:16 UTC, Michael Mair-Keimberger (iamnr3)	Details
crash log from 21-06-2023 (emerge-crash-21-06-23.log,16.03 KB, text/x-log) 2023-08-20 08:17 UTC, Michael Mair-Keimberger (iamnr3)	Details
crash log from 27-06-2023 (emerge-crash-27-06-23.log,5.42 KB, text/x-log) 2023-08-20 08:17 UTC, Michael Mair-Keimberger (iamnr3)	Details
crash log from 11-09-2023 (crash-09-11-2023.log,17.17 KB, text/x-log) 2023-09-13 08:57 UTC, Michael Mair-Keimberger (iamnr3)	Details
crash log from 13-09-2023 (crash-09-13-2023.log,15.18 KB, text/x-log) 2023-09-13 08:57 UTC, Michael Mair-Keimberger (iamnr3)	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Mair-Keimberger (iamnr3) 2023-08-20 08:16:54 UTC

Created attachment 868237 [details]
crash log from 20-08-2023

Hi,

On one of my machines i having sometimes troubles with emerge when trying to do a "emerge -avuDN @world". While today i had 3 crashes in a row i don't think it's a problem with portage-3.0.50 itself (which is why i decided to wirte only sys-apps/portage in the summary). Similar crashes happen in the past. Since most of the time it works with the second try i didn't bother to make a bug report.
Even now this bug is not really important for me since portage works without problems.

Some more Details:
- This machine is a virtual machine. It runs on a gentoo host with 4 other vm's. These other vm's (also gentoo boxes) doesn't have these random crashes so i think there is no hardware problem.
- The crashes seem to happen after some time of running. Usually i update (and restart) the vm's every few days or so (i would say twice a week, sometimes longer). Today the machine was running 2 weeks.
- Apart from the crashes everything works as expected. As mentioned, emerge usually works after the second try anyway.

The crashes from today (i had 3 crashes when calling emerge -avuDN @world) i put into a file called emerge-crash-20-08-23.log. There you can see the 3 crashes from portage, the dmesg output and the output of free.

If you need further details please let me know. The vm's is also still running if i should get some more details.

The dmesg output seems to be always the same:
emerge: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0

I've also put logs from 21.06.23 (dmesg, emerge crash and emerge --info) aswell as also from 27.06.23 (only the dmesg output)


Regards

Comment 1 Michael Mair-Keimberger (iamnr3) 2023-08-20 08:17:22 UTC

Created attachment 868238 [details]
crash log from 21-06-2023

Comment 2 Michael Mair-Keimberger (iamnr3) 2023-08-20 08:17:36 UTC

Created attachment 868239 [details]
crash log from 27-06-2023

Comment 3 Sam James archtester

2023-08-20 08:20:22 UTC

Userland applications shouldn't be able to crash the kernel.

Your error makes me wonder if it is OOMing or something though.

* Could you humour me and do a memtest on the host when you get a chance?
* Please upgrade to the latest 6.1.x kernel.

Comment 4 Michael Mair-Keimberger (iamnr3) 2023-08-20 14:49:00 UTC

(In reply to Sam James from comment #3)
> Userland applications shouldn't be able to crash the kernel.
> 
> Your error makes me wonder if it is OOMing or something though.
> 
> * Could you humour me and do a memtest on the host when you get a chance?
> * Please upgrade to the latest 6.1.x kernel.

Yes, i thought this already too. The host only has 4GB RAM and 4GB swap. But used RAM is mostly at 500MB and since emerge only installs binary package (from my internal binhost) it never compiles packages either.

BTW, i also haven't over provisioned anything (CPU or RAM). The vm's should have no problem with their resources.

Memtest will be difficult. This is a host from hetzner, so i have no physical access. I can order a remote console but that would take time and since there are other services running on it it's going to be difficult.

Sure, i'll going to update the kernel. This was on my todo anyway. FYI: Right now, the host and the guest, running on 6.1.38.

Comment 5 Michael Mair-Keimberger (iamnr3) 2023-09-13 08:56:56 UTC

some updates:

Two days ago i've had another crash. Two in a row to be precise. I saved both crashes in a log including relevant dmesg output. The system was running 6 days that time. After the 3rd time starting emerge it worked as expected, however i've decided not to update.
Today i've run another emerge -avuDN @world (no updates or reboots since then) and it crashed again once (second time now worked again). Again i saved the relevant information in the logs.

I haven't made a memtest yet but it's on my todo list.

All vm's running on 6.1.46-gentoo. Actually they all use the same kernel as i'm passing the kernel directly via qemu. 
Something like: qemu-system-x86_64 -name vs4 -monitor unix:/run/kvm/vs4.sock,server=on,wait=off -pidfile /run/kvm/vs4.pid -device virtio-balloon -m 4096 -smp cores=2,threads=1,sockets=1 -machine q35 -k de -cpu host -accel kvm -runas nobody -kernel /mnt/data/vm/kernel/gentoo-latest -initrd /mnt/data/vm/kernel/initrd-v2.cpio.gz ...

Comment 6 Michael Mair-Keimberger (iamnr3) 2023-09-13 08:57:27 UTC

Created attachment 870476 [details]
crash log from 11-09-2023

Comment 7 Michael Mair-Keimberger (iamnr3) 2023-09-13 08:57:44 UTC

Created attachment 870477 [details]
crash log from 13-09-2023

Comment 8 Michael Mair-Keimberger (iamnr3) 2024-02-18 10:55:54 UTC

Some news about this issue.

I found out this isn't a problem with emerge/portage and i will close this issue.

Some background:
Today i got these crashes again. (got them in the past too, but haven't looked into it) Looking a bit deeper into this issue i found that a simple 

# ls /var/db/repose/gentoo

already triggers these crashes with following error:

ls: reading directory '/var/db/repos/gentoo/': Cannot allocate memory

Now, as before, there is still plenty of memory available. The kernel is at 6.6.13 (as i'm doing regularly updates) and everything else works as expected. 
/var/db/repos/gentoo/ is actually a 9P filesystem from the host (mounted only ro).
Even a remount (umount & mount) of the 9P filesystem wouldn't fix the issue. Only after some time and trying with `ls`, the filesystem lists again. Other directories could be listed without troubles, only the 9P didn't work.

I have no idea what the problem is. The host itself is also at the latest stable kernel and got rebooted too previously (but i didn't had the time to do a memory check) The exact same configuration is made on the 4 other vm's with no issues at all.

However since i want to switch to virtio-fs anyway i hope this error will be gone for good, which is why i'm not going to dig deeper.