Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 910058

Summary: x11-drivers/nvidia-drivers: No working tty/framebuffer on a machine with two GPUs (any other DRM driver loaded)
Product: Gentoo Linux Reporter: zurabid2016
Component: Current packagesAssignee: Ionen Wolkens <ionen>
Status: RESOLVED FIXED    
Severity: normal CC: soap
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=216303
https://github.com/NVIDIA/open-gpu-kernel-modules/issues/341
Whiteboard:
Package list:
Runtime testing required: ---

Description zurabid2016 2023-07-08 12:08:01 UTC
This has already been brought up both on GitHub (https://github.com/NVIDIA/open-gpu-kernel-modules/issues/341) and kernel bugzilla (https://bugzilla.kernel.org/show_bug.cgi?id=216303).
**The problem is** that the current version of ebuild (https://gitweb.gentoo.org/repo/gentoo.git/plain/x11-drivers/nvidia-drivers/nvidia-drivers-535.54.03.ebuild) warns about potential non-working TTY, but the cause is not SimpleFB (which, by the way, has been working excessively well for me for a few months; it should be prefered over {efi,vesa}fb), but rather any **other DRM module being loaded**.

Current version of NVIDIA drivers rely on other fbdev drivers (efifb, vesafb, simplefb) to provide framebuffer/TTY support. A quote (see https://bugzilla.kernel.org/show_bug.cgi?id=216303#c5):

> As mentioned. It is the Nvidia driver fault because it should register all the
> needed interfaces. This includes not only a DRI device for KMS/DRM but also a
> fbdev, if they want to support fbcon and virtual terminals.
> 
> That's what all other DRM drivers do, no other driver AFAICT was relying on a
> different fbdev driver for this. So the real fix would be for the Nvidia
> driver to register an emulated fbdev as mentioned.

Here is a full explanation from NVIDIA engineer (https://bugzilla.kernel.org/show_bug.cgi?id=216303#c28):

> I've been looking at this from the NVIDIA side.
> 
> If I understand the regression commit ee7a69aa3 correctly, the original
> problem it solves is a race between sysfb_init registering a platform device
> for the system boot console, and a drm driver calling into
> do_remove_conflicting_framebuffers: if sysfb_init is called before drm
> initializes, then remove_conflicting_framebuffers will remove the platform
> device sysfb adds. But if drm initializes before sysfb_init is called, then
> there will be no platform device to remove and then sysfb will still add one
> later. Is that a correct understanding of the motivation behind the new call
> to sysfb_disable in this change?
> 
> To recap this bug, the problem here is that there are two graphics devices in
> the system, one of which is displaying a framebuffer console that was set up
> by the boot firmware. In this bug report, the boot console is on an NVIDIA
> device and the other device is an Intel integrated GPU, but we've also seen
> this problem on server or workstation systems that have an ASPEED device
> alongside an NVIDIA one. In this situation, the following sequence happens:
> 
>  1. EFI firmware sets a mode on the NVIDIA GPU and uses it for the boot
>    console.
>  2. The kernel starts, reads the EFI boot payload, and loads efifb.
>  3. The kernel calls sysfb_init, which creates a platform device for the efifb
>     console on the NVIDIA GPU.
>  4. The kernel detects the Intel GPU and loads i915, which registers its own
>     framebuffer. Since this is not the boot device,
>     remove_conflicting_framebuffers does not find the efifb framebuffer as
>     conflicting and does not remove it.
> 
> Commit ee7a69aa3 adds a new step 3.5:
> 
>  3.5. Call sysfb_disable, which removes the efifb framebuffer console on the
>    NVIDIA GPU even though it does not conflict with the i915 framebuffer.
> 
> This is what breaks the console on these configurations, regardless of whether
> or not the NVIDIA driver is installed or loaded.
> 
> I'm looking at making the NVIDIA driver install its own framebuffer console in
> order to work around this problem, but that will take a little while to
> develop and get it into production. In the meantime, would it make sense to
> make sysfb_disable a little smarter about which device the boot framebuffer is
> on and only disable itself if a framebuffer that actually conflicts with it is
> being enabled? This would also help affected users who choose not to install
> the NVIDIA driver at all.

I can confirm that by **blacklisting i915** driver on my dual GPU desktop (NVIDIA + Intel) TTY/framebuffer/console (whatever it's called) works perfectly. This should be reflected in the ebuild, I think, by removing the part about SimpleFB, and, either somehow checking for any other DRM driver being enabled, or placing the warning unconditionally.
Comment 1 Ionen Wolkens gentoo-dev 2023-07-08 12:37:09 UTC
Even if checked for DRM devices, there's no guarantee they are even being used and it'd just be noise (similar deal for FB but the scope is more limited).

And wouldn't want to be noisy about this for everyone with every emerges. At best could add something to the README.gentoo, not that I have any dual gpu setups myself to test what I'd be talking about.

Things like this feel better suited for the wiki written by people affected by the issue. README.gentoo already points there.

My preferred solution to the ebuild would be to drop all the current FB warnings, and refer users to the wiki if they have framebuffer issues more clearly in README.gentoo. But that's assuming someone updates it.

wrt simplefb, I did have issues with it myself last I tried and know a few others who did as well -- but heard it was fine for others, so I remained vague in the message ("may", "feel free to ignore"). May need to retest sometime though.
Comment 2 Larry the Git Cow gentoo-dev 2023-07-08 14:17:26 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=f7ea967d210c66a14adf12c5c6b177a76aba6b1b

commit f7ea967d210c66a14adf12c5c6b177a76aba6b1b
Author:     Ionen Wolkens <ionen@gentoo.org>
AuthorDate: 2023-07-08 14:02:56 +0000
Commit:     Ionen Wolkens <ionen@gentoo.org>
CommitDate: 2023-07-08 14:16:14 +0000

    x11-drivers/nvidia-drivers: drop additional framebuffer/drm warnings
    
    Warnings were mostly added to help the transition to newer
    kernels for unsuspecting users. May have helped in some cases,
    not in others.
    
    But with 6.1.x being stable for a while, there's little reason
    to keep this wall of warnings *here* and try to keep it accurate
    and updated -- especially when we can't tell what's really in-use
    or what the user needs (this was just vague suggestions).
    
    For initial setting up issues, it sounds better to refer to the
    Wiki. So if anyone has anything to share with their experience
    with FB (or other issues) feel free to edit it and improve it so
    it can help others.
    
    Also drop the "builtin" nouveau check that was part of this block.
    Module is already blacklisted and, if users went out of their way
    to make it builtin, then let's assume they know what they're doing.
    
    Closing #910058 but rather than a fix it's more of a dissociation.
    
    Closes: https://bugs.gentoo.org/910058
    Signed-off-by: Ionen Wolkens <ionen@gentoo.org>

 .../nvidia-drivers/nvidia-drivers-390.157.ebuild   | 68 +---------------------
 .../nvidia-drivers-470.199.02.ebuild               | 68 +---------------------
 .../nvidia-drivers-525.125.06.ebuild               | 68 +---------------------
 .../nvidia-drivers/nvidia-drivers-525.47.27.ebuild | 68 +---------------------
 .../nvidia-drivers/nvidia-drivers-535.54.03.ebuild | 68 +---------------------
 5 files changed, 15 insertions(+), 325 deletions(-)
Comment 3 zurabid2016 2023-07-21 09:05:43 UTC
So, I reworked a wiki page the bit (https://wiki.gentoo.org/wiki/NVIDIA/nvidia-drivers), including some more recommendations about required kernel configuration. This very issue is fixed by kernel commit 5ae3716cfdcd286268133867f67d0803847acefc ("video/aperture: Only remove sysfb on the default vga pci device")[see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5ae3716cfdcd286268133867f67d0803847acefc], which should be present in 6.5 kernels and may be backported to linux-stable (it isn't yet).
Comment 4 Ionen Wolkens gentoo-dev 2023-07-21 09:14:49 UTC
Many thanks for your work on the wiki.

And yeah, I've seen the reply the github issue wrt the commit. Haven't tested but glad to hear should have less to worry about.