Summary: | sys-kernel/gentoo-sources-3.10.7 in drm_kms_helper / nouveau - BUG: unable to handle kernel paging request at NNNNNNNNNNNNNNNN | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Fabian Köster <gentoo> |
Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | RESOLVED UPSTREAM | ||
Severity: | normal | ||
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | AMD64 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
screenshot 1
screenshot 2 dmesg output |
Description
Fabian Köster
2013-08-27 15:52:47 UTC
Created attachment 357176 [details]
screenshot 1
Created attachment 357178 [details]
screenshot 2
I will try =sys-kernel/gentoo-sources-3.10.9 and see if the problem is present there. (In reply to Fabian Köster from comment #3) > I will try =sys-kernel/gentoo-sources-3.10.9 and see if the problem is > present there. Though the changelogs for .8 and .9 do not seem to include a fix for this. Thank you for your report, will try to look at this soon. Also happens with sys-kernel/gentoo-sources-3.10.9. I now switch back to sys-kernel/gentoo-sources-3.9.11-r1 and see if the problem is really not present there (and this bug a regression). (In reply to Fabian Köster from comment #6) > Also happens with sys-kernel/gentoo-sources-3.10.9. > > I now switch back to sys-kernel/gentoo-sources-3.9.11-r1 and see if the > problem is really not present there (and this bug a regression). The problem also occurs with kernel sys-kernel/gentoo-sources-3.9.11-r1 but if I disable nvidia graphics (just internal intel graphics) the issue seems to be gone. TL;DR: What was the last working kernel version? Let's see what calls those debug functions... drivers/gpu/drm/nouveau/core/engine/graph/nvc0.c: if (!nv_wait(priv, 0x409800, 0x80000000, 0x80000000)) { nv_error(priv, "HUB_INIT timed out\n"); nvc0_graph_ctxctl_debug(priv); return -EBUSY; } Ah, some kind of HUB_INIT timed out. Let's see, 0x409800 is PGRAPH.CTXCTL.CC_SCRATCH[0] for NVC0. https://github.com/envytools/envytools/blob/master/rnndb/nvc0_pgraph.xml#L693 That doesn't say much on its own, let's see what CC_SCRATCH is used for. http://cgit.freedesktop.org/nouveau/linux-2.6/plain/drivers/gpu/drm/nouveau/core/engine/graph/fuc/gpc.fuc http://cgit.freedesktop.org/nouveau/linux-2.6/plain/drivers/gpu/drm/nouveau/core/engine/graph/fuc/hup.fuc GPC stands for Graphics Processing Cluster, FUC I don't know what that stands for but it appears to be the microcode language (as you can deduce my the extension) that gets loaded into the GPC. Actually, if we go back to nvc0.c we see comments before wait that read: /* load HUB microcode */ /* load GPC microcode */ /* start HUB ucode running, it'll init the GPCs */ So, the HUB is something that will init the GPCs; so it is a hub of GPCs. Since the code around there wasn't really touched in 2012 the problem does not lie with the loading code; so, the problem more likely lies with the microcode. # git blame drivers/gpu/drm/nouveau/core/engine/graph/fuc/gpcnvc0.fuc | grep 2013 3f196a04 drivers/gpu/drm/nouveau/core/engine/graph/fuc/gpcnvc0.fuc (Ben Skeggs 2013-03-30 21:56:26 +1000 90) .b8 0xd7 0 0 0 3f196a04 drivers/gpu/drm/nouveau/core/engine/graph/fuc/gpcnvc0.fuc (Ben Skeggs 2013-03-30 21:56:26 +1000 91) .b16 #nvd9_gpc_mmio_head 3f196a04 drivers/gpu/drm/nouveau/core/engine/graph/fuc/gpcnvc0.fuc (Ben Skeggs 2013-03-30 21:56:26 +1000 92) .b16 #nvd9_gpc_mmio_tail 3f196a04 drivers/gpu/drm/nouveau/core/engine/graph/fuc/gpcnvc0.fuc (Ben Skeggs 2013-03-30 21:56:26 +1000 93) .b16 #nvd9_tpc_mmio_head 3f196a04 drivers/gpu/drm/nouveau/core/engine/graph/fuc/gpcnvc0.fuc (Ben Skeggs 2013-03-30 21:56:26 +1000 94) .b16 #nvd9_tpc_mmio_tail # git blame drivers/gpu/drm/nouveau/core/engine/graph/fuc/hubnvc0.fuc | grep 2013 3f196a04 drivers/gpu/drm/nouveau/core/engine/graph/fuc/hubnvc0.fuc (Ben Skeggs 2013-03-30 21:56:26 +1000 65) .b8 0xd7 0 0 0 3f196a04 drivers/gpu/drm/nouveau/core/engine/graph/fuc/hubnvc0.fuc (Ben Skeggs 2013-03-30 21:56:26 +1000 66) .b16 #nvd9_hub_mmio_head 3f196a04 drivers/gpu/drm/nouveau/core/engine/graph/fuc/hubnvc0.fuc (Ben Skeggs 2013-03-30 21:56:26 +1000 67) .b16 #nvd9_hub_mmio_tail Oh, look, upstream those files were changed on 30 March 2013; commit log: commit 3f196a045e2f7e0b7c5302d359a9772c1567d55b Author: Ben Skeggs <bskeggs@redhat.com> Date: Sat Mar 30 21:56:26 2013 +1000 drm/nve0: magic up some support for GF117 Seen in the wild, don't have the hardware but this hacks things up to treat it the same as GF119 for now. Should be relatively safe, I'd be very surprised if anything major changed outside of PGRAPH. PGRAPH (3D etc) is disabled by default however until it's confirmed working. Signed-off-by: Ben Skeggs <bskeggs@redhat.com> Now let's see which stable versions this commit was introduced in; 3.10-rc1. # git tag --contains 3f196a045e2f7e0b7c5302d359a9772c1567d55b | tr '\n' ' ' v3.10 v3.10-rc1 v3.10-rc2 v3.10-rc3 v3.10-rc4 v3.10-rc5 v3.10-rc6 v3.10-rc7 v3.10.1 v3.10.10 v3.10.2 v3.10.3 v3.10.4 v3.10.5 v3.10.6 v3.10.7 v3.10.8 v3.10.9 v3.11 v3.11-rc1 v3.11-rc2 v3.11-rc3 v3.11-rc4 v3.11-rc5 v3.11-rc6 v3.11-rc7 You have a GF119 and if you look that up in drivers/gpu/drm/nouveau/core/engine/device/nvc0.c you see it is 0xd9 whereas the above microcode seems to deal with 0xd7; so, we're not there yet, let's look at the commit whether it broke functionality for you. http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=3f196a045e2f7e0b7c5302d359a9772c1567d55b Bummer, no change in logic for that! That's not what we are looking for... :( So, back to `git log drivers/gpu/drm/nouveau/core/engine/graph/nvc0.c`; when I look there, this commit jumps to my eye because it mentions GF119 in graph FUC: commit 902530693ef38f3bb007efae594e54443d84fa56 Author: Ben Skeggs <bskeggs@redhat.com> Date: Thu Dec 20 12:50:52 2012 +1000 drm/nvc0/graph: fix fuc, and enable acceleration on GF119 Signed-off-by: Ben Skeggs <bskeggs@redhat.com> # git tag --contains 902530693ef38f3bb007efae594e54443d84fa56 | tr '\n' ' ' v3.10 v3.10-rc1 v3.10-rc2 v3.10-rc3 v3.10-rc4 v3.10-rc5 v3.10-rc6 v3.10-rc7 v3.10.1 v3.10.10 v3.10.2 v3.10.3 v3.10.4 v3.10.5 v3.10.6 v3.10.7 v3.10.8 v3.10.9 v3.11 v3.11-rc1 v3.11-rc2 v3.11-rc3 v3.11-rc4 v3.11-rc5 v3.11-rc6 v3.11-rc7 v3.8 v3.8-rc2 v3.8-rc3 v3.8-rc4 v3.8-rc5 v3.8-rc6 v3.8-rc7 v3.8.1 v3.8.10 v3.8.11 v3.8.12 v3.8.13 v3.8.2 v3.8.3 v3.8.4 v3.8.5 v3.8.6 v3.8.7 v3.8.8 v3.8.9 v3.9 v3.9-rc1 v3.9-rc2 v3.9-rc3 v3.9-rc4 v3.9-rc5 v3.9-rc6 v3.9-rc7 v3.9-rc8 v3.9.1 v3.9.10 v3.9.11 v3.9.2 v3.9.3 v3.9.4 v3.9.5 v3.9.6 v3.9.7 v3.9.8 v3.9.9 Ugh, that's long, this appears to have been introduced in v3.8-rc2; so, now I wonder if your last working kernel was before or after v3.8. So, could you tell us what your last working kernel was so we can restrict the commit log and know more exactly in which range of time to look? (In reply to Tom Wijsman (TomWij) from comment #8) Wow, thanks for your in-depth analysis! > Ugh, that's long, this appears to have been introduced in v3.8-rc2; so, now > I wonder if your last working kernel was before or after v3.8. > > So, could you tell us what your last working kernel was so we can restrict > the commit log and know more exactly in which range of time to look? The problem did at least not occur when I was using Kernel 3.8 but I assume it was just not triggered before my switch to GNOME 3 (Previously I was using Xfce and just stable Xorg packages). When I am back at home on Friday I can do more testing of this issue, e.g. reverting commit 902530693ef38f3bb007efae594e54443d84fa56 and see if this fixes my problem. Or do you maybe have other ideas? (In reply to Fabian Köster from comment #9) > (In reply to Tom Wijsman (TomWij) from comment #8) > > When I am back at home on Friday I can do more testing of this issue, e.g. > reverting commit 902530693ef38f3bb007efae594e54443d84fa56 and see if this > fixes my problem. Or do you maybe have other ideas? I built a Kernel from commit 5ddf4d4a543dd3303b20d7e9a4b3549589c5f095 (the one before the commit mentioned above) and I will test it now. I am not sure though, if this bug could really be hidden for such a long time... (In reply to Fabian Köster from comment #10) > I built a Kernel from commit 5ddf4d4a543dd3303b20d7e9a4b3549589c5f095 (the > one before the commit mentioned above) and I will test it now. The bug now also happened with commit 5ddf4d4a543dd3303b20d7e9a4b3549589c5f095 so commit 902530693ef38f3bb007efae594e54443d84fa56 seems not to be the root of the problem. Any other ideas? Created attachment 358104 [details]
dmesg output
I noticed I have still access to the machine when this issue happens so I attach the relevant output of dmesg.
I just found two bug reports in Red Hat's tracker: https://bugzilla.redhat.com/show_bug.cgi?id=917202 https://bugzilla.redhat.com/show_bug.cgi?id=994291 Maybe they are related? (In reply to Fabian Köster from comment #12) > Created attachment 358104 [details] > dmesg output > > I noticed I have still access to the machine when this issue happens so I > attach the relevant output of dmesg. Hmm, it reads this: > [ 5417.069919] Code: Bad RIP value. > [ 5417.069947] RIP [< (null)>] (null) > [ 5417.069980] RSP <ffff88022d825b60> > [ 5417.070001] CR2: 0000000000000000 > [ 5417.070029] ---[ end trace 13059c79dd277520 ]--- > [ 5417.070056] Fixing recursive fault but reboot is needed! RIP is null; well, that is kind of odd and means we don't have an instruction to refer to at which point it went wrong. We do have a stack trace however; but because RIP is null, I wonder if the integrity of that stack trace is still alright. Given that it then says that this is a recursive fault I kind of have a doubt about that. Some questions: 1. Is this something that floods the dmesg? 2. Could you provide me the earliest BUG / OOPS / TRACE output it gives? 3. If you reproduce this, do the function names in the trace stay the same? Thank you in advance. As for more ideas, I'll look at what more we can do soon... (In reply to Tom Wijsman (TomWij) from comment #14) > > Some questions: > > 1. Is this something that floods the dmesg? No, of what I have seen so far it is just a single message. > 2. Could you provide me the earliest BUG / OOPS / TRACE output it gives? If I reproduce it again, I will check if there is something. > 3. If you reproduce this, do the function names in the trace stay the same? No, I am pretty sure I have already seen different function names in trace. > RIP is null; well, that is kind of odd and means we don't have an > instruction to refer to at which point it went wrong. We do have a stack > trace however; but because RIP is null, I wonder if the integrity of that > stack trace is still alright. Given that it then says that this is a > recursive fault I kind of have a doubt about that. Because it is so odd and also not many people seem to have this issue I have the fear that this is maybe caused by a hardware defect in my discrete graphics unit. Is this possible? This bug seems to be gone probably after upgrading Kernel to current 3.12 release candidates. This new major Kernel version includes many changes for Optimus hardware. |