752351 – sys-kernel/genkernel-4.1.2-r3 building kernel with broken lvm

Bug 752351 - sys-kernel/genkernel-4.1.2-r3 building kernel with broken lvm

Summary: sys-kernel/genkernel-4.1.2-r3 building kernel with broken lvm

Status:	RESOLVED INVALID

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Gentoo Genkernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2020-11-02 14:12 UTC by r7l
Modified:	2020-11-17 13:58 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Genkernel 4.0.10 log (genkernel-4.0.10.log.gz,417.98 KB, text/plain) 2020-11-04 13:37 UTC, r7l	Details
Genkernel 4.1.2-r3 log (genkernel-4.1.2-r3.log.gz,449.43 KB, application/gzip) 2020-11-04 13:39 UTC, r7l	Details
init.log (init.log.gz,1.37 KB, application/gzip) 2020-11-17 11:19 UTC, r7l	Details
udev.log (udevd.log.gz,25.99 KB, application/gzip) 2020-11-17 11:20 UTC, r7l	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description r7l 2020-11-02 14:12:08 UTC

On a system with lvm on top of dm-crypt and mdadm raid, using sys-kernel/genkernel-4.1.2-r3, i am not able to create a kernel and initramfs with properly working LVM.

When starting the system, the kernel boots as expected, raids are assembled and i am able to unlock the partition. It already shows a few errors early one about missing kernel modules but once entering LVM, it gets stuck completely with lots of messages like these:

WARNING: Device /dev/sda not initialized in udev database even after waiting 10000000 microseconds.
WARNING: Device /dev/md0 not initialized in udev database even after waiting 10000000 microseconds.

The same kernel .config and genkernel.conf (except for etc-update changes) works as expected when downgrading back to 4.0.10. The kernel i am using is 5.4.72.

Reproducible: Always

Comment 1 Jonas Stein gentoo-dev

2020-11-02 18:30:57 UTC

Could you please try to collect more information with the help of our IRC channel #gentoo 
Thanks.

Comment 2 Thomas Deutschmann (RETIRED) gentoo-dev

2020-11-02 19:13:33 UTC

Are you using SELinux?

Comment 3 r7l 2020-11-03 10:32:22 UTC

What kinda of information should be collected? I am currently rebuilding the kernel using sys-kernel/genkernel-4.1.2-r3 in order to be able to provide a log file from it. I will also rebuild it with 4.0.10 later on and provide a log file from that.

But it will take some time as the system is really slow (just some Celeron) and i can't work on it to much as it is still in productive use during daytime.

I am not sure how to obtain anything from booting other then rc.log or dmesg but i might check in to IRC.

It's not running SELinux. It's running a pretty simple OpenRC based Gentoo with Raid and Crypt. I am also using genkernels SSH remote unlock feature, wich also still works as expected.

Comment 4 Thomas Deutschmann (RETIRED) gentoo-dev

2020-11-03 23:19:17 UTC

In Genkernel >=4.1 we are now using UDEV in initramfs to initialize devices. UDEV will create /run/udev/data. This data must be preserved so that UDEV from real system can continue to use the devices.

Whenever this is not possible, for example there is currently a known problem with SELinux where /run becomes unusable due to missing labels, you will see problems like this.

So start with debugging mount/udev service from real system and verify that /run/udev/data is preserved, available and will get loaded on real system.

Comment 5 r7l 2020-11-04 13:37:36 UTC

Created attachment 669992 [details]
Genkernel 4.0.10 log

This is the log of the working 4.0.10 version.

Comment 6 r7l 2020-11-04 13:39:04 UTC

Created attachment 669995 [details]
Genkernel 4.1.2-r3 log

This is the log of the broken 4.1.2-r3 version.

Comment 7 r7l 2020-11-04 13:49:30 UTC

I've just added the logs of both version. Both of them are running with pretty much the same configuration and i've activated the cleanup options in order to not have anything being reused. I hope it does what it says.

Other then that, i guess your last comment seems to point out the issue. While not using SELinux, all the errors i see are either related to /run or udev. 

As said, i can't really check for errors right now as i've just finished to rebuild of the kernel in 4.0.10 and i would need to reinstall 4.1.2-r3 and redo the kernel there but from what i can recall, there is an error about OpenRC complaining about something in /run, another error about a single service not being able to create a pid file in /run while all other services seem to be able to. Other then that it's mostly udev issues like said initially.

Will take a bit as the kernel compiles for hours and i need the system to run right now.

Is there any easy way to debug the mount? Never did debugging for the boot process. The system is a real system btw.

Comment 8 Thomas Deutschmann (RETIRED) gentoo-dev

2020-11-05 13:27:27 UTC

Start with checking /etc/fstab and also compare enabled services. If your installation matches a stage3 system (=strictly following *official* handbook), you shouldn't see such an issue.

Add some debug code to /etc/init.d/bootmisc to list /run content before and after bootmisc service.

Comment 9 r7l 2020-11-06 23:30:53 UTC

The system is running for a number of years. Can't really say how long and i've also migrated it from one hardware to another. So i am not entirely sure but it's been a while and while i am pretty certain to have followed the install guide during the inital setup for the most part. But i might have tempered with it here and there over the years.

I've tried a couple things and added some debug into /etc/init.d/bootmisc to list /run.

There is message right before OpenRC starts, after unlocking the partition and resuming boot:

Something about "udevd still running! Trying to kill it"

The error message i see during boot is:

fopen(/run/openrc/rc.log) failed: No such file or directory

Other then that it's errors about missing kernel modules (which are there once booted with older genkernel) and this PID creation error of a single service.

I've added some debugging to /etc/init.d/bootmisc as you said but i am not sure what to look for. There is something in /run prior of the first error and it gets called right after that error with different output.

Not sure if i've found something. At some point i've added tmpfs to /run and /tmp in fstab. I think i've found it somewhere in some forum and it never was an issue. But i am going to remove it and rerun the genkernel build. I might report back once that's done and if that's what have caused issues.

Comment 10 Thomas Deutschmann (RETIRED) gentoo-dev

2020-11-07 01:29:58 UTC

> Something about "udevd still running! Trying to kill it"

This sounds like https://gitweb.gentoo.org/proj/genkernel.git/tree/defaults/linuxrc?h=v4.1.2#n1317 which would already indicate some kind of abnormal behavior.

Do you see /run/initramfs with content at all?

Comment 11 r7l 2020-11-07 09:53:59 UTC

It works now and the reason for it to fail was that line in /etc/fstab that mounted /run to tmpfs. As said before, i am not sure how long it's been since i've added it there. But once removed and after rerunning genkernel, it boots fine now.

I am still seeing this warning and yes, it's this exact message you've linked to in git. Not sure how to look into this issue as it is even prior of OpenRC or any service being started. It's right after entering "resume-boot" in SSH.

Other then that, i guess this LVM issue is resolved for me. No more errors and warnings and the system is running fine again.

Thanks allot for your help!

Comment 12 Thomas Deutschmann (RETIRED) gentoo-dev

2020-11-07 14:01:16 UTC

OK, you should now see /run/initramfs after boot.

Please add "gk.udev.debug=yes udev.children-max=1" to kernel command-line and reboot. Please share /run/initramfs/init.log and /run/initramfs/udevd.log afterwards.

Comment 13 r7l 2020-11-17 11:19:51 UTC

Created attachment 671827 [details]
init.log

Comment 14 r7l 2020-11-17 11:20:14 UTC

Created attachment 671830 [details]
udev.log

Comment 15 r7l 2020-11-17 11:25:40 UTC

This last remaining error message seems to be less consistent after fixing the issue with /run. I've did a number of kernel rebuilds and restarts and it appears every now and then but i wasn't able to make it appear with the debug options attached to the kernel parameters. I've not seen them in the debug logs as well.

But i'm submitting them anyways as there might be a clue for what is wrong.

Beside the message itself, i can't see any issue. The system boots fine now and works as expected.

Comment 16 Thomas Deutschmann (RETIRED) gentoo-dev

2020-11-17 13:58:43 UTC

Well, these logs are looking good. But this is not surprising given that they belong to a successful run.

Maybe keep these kernel command-line arguments set for a while and see if you can catch the error and report back. Would be interesting to understand why

> udevadm control --exit

fails for you sometimes.

I am closing this issue as INVALID for now because the reported issue was caused by your /etc/fstab overmounting /run. As said, keep posting to this bug in case you will get new logs showing an error.

Thanks!