Created attachment 349992 [details] output of lspci -v Kernels 3.7 and 3.8 crash frequently when using nouveau drivers with the nVidia chipset GeForce7025/nForce630a (detailed chipset PCI information see attachment) and applications that use graphics acceleration (e.g. googleearth) No dmesg etc. available because system hangs completely, and screen gets garbled to "stripes". Related information: Forum entry: http://forums.gentoo.org/viewtopic-t-952224.html Looks like related to: https://bugzilla.kernel.org/show_bug.cgi?id=50091 https://bugs.freedesktop.org/show_bug.cgi?id=61321 Kernel 3.6.11 works 100% stable on same system. NVdida binary drivers also work on same system.
Does this still happen on recent kernels? Please try =sys-kernel/gentoo-sources-3.10.6 (upstream stable kernel), =sys-kernel/git-sources-3.11_rc5 (upstream development kernel) and if you know how to use git then please consider to try http://cgit.freedesktop.org/nouveau/linux-2.6/ too (`git clone http://anongit.freedesktop.org/git/nouveau/linux-2.6`). Seems upstream bugs did not progress; so, I hope my comment woke them up. Thank you very much in advance!
Still happens on gentoo-sources-3.10.7 (stable for amd64) and git-sources-3.11_rc6. I could't get a compileable kernel using http://cgit.freedesktop.org/nouveau/linux-2.6/
Still no answer on the upstream bugs. :( I wonder if this is problematic: [ 9.136306] nouveau [ VBIOS][0000:01:00.0] checking PRAMIN for image... [ 9.201109] nouveau [ VBIOS][0000:01:00.0] ... checksum invalid [ 9.201116] nouveau [ VBIOS][0000:01:00.0] checking PROM for image... [ 9.201142] nouveau [ VBIOS][0000:01:00.0] ... signature not found [ 9.201145] nouveau [ VBIOS][0000:01:00.0] checking ACPI for image... [ 9.201149] nouveau [ VBIOS][0000:01:00.0] ... signature not found [ 9.201152] nouveau [ VBIOS][0000:01:00.0] checking PCIROM for image... [ 9.202204] nouveau [ VBIOS][0000:01:00.0] ... checksum invalid They all mention as "invalid" and "signature not found". Might be a false positive though, it loads it anyway; but still, checksums are there for a purpose. Can you try to obtain a dmesg with the REISUB method? http://en.wikipedia.org/wiki/Magic_SysRq_key#Uses Alternatively a rescue Linux kernel might be an option. The only other things I can suggest is to try to bump the upstream bugs as well as try to compile the latest Nouveau kernel again. (If it fails, feel free to provide the error output)
The VBIOS error does not appear here (it was a different user in the forum who reported this). I am getting for dmesg @ grep nouveau with kernel 3.10.7-r1: [ 0.338491] nouveau 0000:00:0d.0: setting latency timer to 64 [ 0.339055] nouveau [ DEVICE][0000:00:0d.0] BOOT0 : 0x04c000a2 [ 0.339271] nouveau [ DEVICE][0000:00:0d.0] Chipset: C61 (NV4C) [ 0.339488] nouveau [ DEVICE][0000:00:0d.0] Family : NV40 [ 0.340203] nouveau [ VBIOS][0000:00:0d.0] checking PRAMIN for image... [ 0.378954] nouveau [ VBIOS][0000:00:0d.0] ... appears to be valid [ 0.379181] nouveau [ VBIOS][0000:00:0d.0] using image from PRAMIN [ 0.379550] nouveau [ VBIOS][0000:00:0d.0] BIT signature found [ 0.379767] nouveau [ VBIOS][0000:00:0d.0] version 05.61.32.28.01 [ 0.380182] nouveau [ PFB][0000:00:0d.0] RAM type: unknown [ 0.380381] nouveau [ PFB][0000:00:0d.0] RAM size: 64 MiB [ 0.380579] nouveau [ PFB][0000:00:0d.0] ZCOMP: 0 tags [ 0.406150] nouveau [ PTHERM][0000:00:0d.0] FAN control: none / external [ 0.406355] nouveau [ PTHERM][0000:00:0d.0] fan management: disabled [ 0.406554] nouveau [ PTHERM][0000:00:0d.0] internal sensor: no [ 0.427757] nouveau [ DRM] VRAM: 61 MiB [ 0.427958] nouveau [ DRM] GART: 512 MiB [ 0.428174] nouveau [ DRM] TMDS table version 1.1 [ 0.428371] nouveau [ DRM] DCB version 3.0 [ 0.428570] nouveau [ DRM] DCB outp 00: 01000310 00000023 [ 0.428769] nouveau [ DRM] DCB outp 01: 00110204 942b0003 [ 0.428968] nouveau [ DRM] DCB conn 00: 0000 [ 0.429191] nouveau [ DRM] DCB conn 01: 1131 [ 0.429407] nouveau [ DRM] DCB conn 02: 0110 [ 0.429624] nouveau [ DRM] DCB conn 03: 0111 [ 0.429841] nouveau [ DRM] DCB conn 04: 0113 [ 0.430182] nouveau [ DRM] Saving VGA fonts [ 0.468494] nouveau W[ DRM] DCB type 4 not known [ 0.468694] nouveau W[ DRM] Unknown-1 has no encoders, removing [ 0.471037] nouveau [ DRM] 1 available performance level(s) [ 0.471248] nouveau [ DRM] 0: core 425MHz shader 425MHz fanspeed 100% [ 0.471447] nouveau [ DRM] c: [ 0.472808] nouveau [ DRM] MM: using M2MF for buffer copies [ 0.521734] nouveau [ DRM] allocated 1920x1200 fb: 0x9000, bo ffff88011b313800 [ 0.522124] fbcon: nouveaufb (fb0) is primary device [ 0.593803] nouveau 0000:00:0d.0: fb0: nouveaufb frame buffer device [ 0.593813] nouveau 0000:00:0d.0: registered panic notifier [ 0.593824] [drm] Initialized nouveau 1.1.1 20120801 for 0000:00:0d.0 on minor 0 [ 8.320638] nouveau E[ PBUS][0000:00:0d.0] MMIO write of 0x004a0001 FAULT at 0x00b000 I tried REISUB method, but as soon as the crash happens, even the magic sysrq keys do not work any more. I'll try to compile the newest nouveau kernel but it will take some days until I have an opportunity to do so.
I now emerged newest sys-kernel/git-sources-3.12_rc6 and retrieved and compiled nouveau using git clone http://anongit.freedesktop.org/git/nouveau/linux-2.6 which results in a kernel 3.12.0-rc3+ Both crashed the same way (3.12.0-rc3+ had additional graphics anomalies, but I think this is due to non-stable tree).
Given that upstream appears to not respond, the way to proceed on this would be to run a bisect between a good 3.6.x kernel and a broken 3.7.x kernel in order to find the bad commit that caused this; you can find instructions on how to do this at https://wiki.gentoo.org/wiki/Kernel_git-bisect
I would try to bisect but I do not understand how to do this with the gentoo kernel sources (which are not on the git repository given in the wiki). gentoo-sources 3.6.11 are installed and work. First tested gentoo-sources which did not work were 3.7.9 (unfortunately these are not on my hdd any more so I need to look where to get them from). kernel.org does not have gentoo sources, and the old vanilla kernels are also gone. So which git source can I use to bisect between these gentoo kernels? (I hope it is OK to ask this here)
(In reply to tomtom69 from comment #7) > I would try to bisect but I do not understand how to do this with the gentoo > kernel sources (which are not on the git repository given in the wiki). > gentoo-sources 3.6.11 are installed and work. > First tested gentoo-sources which did not work were 3.7.9 (unfortunately > these are not on my hdd any more so I need to look where to get them from). > kernel.org does not have gentoo sources, and the old vanilla kernels are > also gone. > So which git source can I use to bisect between these gentoo kernels? > (I hope it is OK to ask this here) We don't patch this; so, you can use the linux-stable listed on the Gentoo Wiki.
I now used linux-stable to bisect between 3.6.11 and 3.7.9, which are the versions I know they work/don't work. Using git bisect as described in the wiki I was able to do 9 steps and decide good/bad after each step. After that, the resulting kernel switches to a black screen when starting nouveau, so I am not able to continue to bisect. The bisect.log up to this step looks as follows: Bisecting: a merge base must be tested [a0d271cbfed1dd50278c6b06bead3d00ba0a88f9] Linux 3.6 Bisecting: 6830 revisions left to test after this (roughly 13 steps) [3f0f0133747368fe0fcf3908f788b53591bff4e0] aoe: use packets that work with the smallest-MTU local interface Bisecting: 3398 revisions left to test after this (roughly 12 steps) [11801e9de26992d37cb869cc74f389b6a7677e0e] Merge tag 'soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Bisecting: 1654 revisions left to test after this (roughly 11 steps) [aecdc33e111b2c447b622e287c6003726daa1426] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Bisecting: 856 revisions left to test after this (roughly 10 steps) [3a494318b14b1bc0f59d2d6ce84c505c74d82d2a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Bisecting: 397 revisions left to test after this (roughly 9 steps) [268d28371cd326be4dfcd7eba5917bf4b9d30c8f] Merge branch 'drm-nouveau-next' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-next Bisecting: 229 revisions left to test after this (roughly 8 steps) [8c0bd3c02d52eff11396e81b4d217ee668e03528] drm/i915: placeholder getparam Bisecting: 99 revisions left to test after this (roughly 7 steps) [8ff1f792dd68ad46f3cfe01e01a375b402cf08da] Merge branch 'drm-next-3.7' of git://people.freedesktop.org/~agd5f/linux into drm-next Bisecting: 49 revisions left to test after this (roughly 6 steps) [43b1e9c9899ece92c1f68d45ae0d7b98d009f5d0] drm/nouveau/device: return proper error codes if ioremap fails Bisecting: 24 revisions left to test after this (roughly 5 steps) [73a60c0d218a292f8ef29d3467726ff26ed366fc] drm/nouveau/gpuobj: remove flags for vm-mappings As a test I continued with "good" despite the black screen, but this results in a kernel panic during boot, so I think it does not make sense to continue this way.
Usually this needs a fix that came in a later version; in cases like this you would want to know which commit caused the screen to go black, such that you can use it as a patch in your existing bisect. So, figure out the last good commit (not this broken one) and the last bad commit; then do `git bisect reset` followed by `git bisect start BAD GOOD` where you replace the BAD and GOOD by the respective commit SHAs, then try again with `git bisect bad` this time. If you're lucky, you might be able to determine the commit regardless; if you're unlucky, you'll want to focus on finding the commit that causes the black screen instead. If you don't know the last good and bad commit; you can also edit the log file and remove the commits at the end where you went wrong, then run `git bisect reset` and then do `git bisect replay LOG` where LOG is the log file in order to continue from the point before the wrong commits.
I bisected between the last good commit and the last bad commit using git bisect start 43b1e9c9899ece92c1f68d45ae0d7b98d009f5d0 8ff1f792dd68ad46f3cfe01e01a375b402cf08da What I found out is that 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 is the first bad commit commit 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 Author: Ben Skeggs <bskeggs@redhat.com> Date: Fri Jul 13 16:49:49 2012 +1000 drm/nv04-nv40/fifo: remove use of nouveau_gpuobj_new_fake() Signed-off-by: Ben Skeggs <bskeggs@redhat.com> :040000 040000 29a991b723d037cfe7fb7a5dd3a34b8321e489d1 cb531c96db341f2340f62511cb7dc1c2b84cefc5 M drivers Up to this commit everything works, including programs that make heavy use of graphics. After this commit my screen turns black during boot (machine does not hang - I am able to blindly switch to console, log in and reboot). However up to the bad commit the problem described in this bug report does not occur. It seems that a later commit causes it but I cannot bisect further than this commit at the moment. The resulting git bisect log is here: # bad: [43b1e9c9899ece92c1f68d45ae0d7b98d009f5d0] drm/nouveau/device: return proper error codes if ioremap fails # good: [8ff1f792dd68ad46f3cfe01e01a375b402cf08da] Merge branch 'drm-next-3.7' of git://people.freedesktop.org/~agd5f/linux into drm-next git bisect start '43b1e9c9899ece92c1f68d45ae0d7b98d009f5d0' '8ff1f792dd68ad46f3cfe01e01a375b402cf08da' # good: [979570e02981d4a8fc20b3cc8fd651856c98ee9d] Linux 3.6-rc7 git bisect good 979570e02981d4a8fc20b3cc8fd651856c98ee9d # good: [97956605d8f2a0d17706cbd338a6cfe8de1920e9] Merge branch 'for-linus-3.6-rc-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml git bisect good 97956605d8f2a0d17706cbd338a6cfe8de1920e9 # good: [70c0f263cc2eb12e02506eb75f0a71490e7dea4d] drm/nouveau/bios: pull in basic vbios subdev, more to come later git bisect good 70c0f263cc2eb12e02506eb75f0a71490e7dea4d # bad: [017e6e2955a8b290653aa71bd321609d0d4b1486] drm/nv04/disp: kick all private state out to own header git bisect bad 017e6e2955a8b290653aa71bd321609d0d4b1486 # good: [0134a97979a0abc1c756b0fe491e074693c2bdf5] drm/nv50-/instmem: allocate vram for kernel objects from end of vram git bisect good 0134a97979a0abc1c756b0fe491e074693c2bdf5 # bad: [9da226f698c01b268b9172050df4150f269a7613] drm/nvc0/fifo: handle bar1 control regs much like fifo/nve0 git bisect bad 9da226f698c01b268b9172050df4150f269a7613 # good: [af7afbd2e1409168698bde2f2846848b07d05d12] drm/nv04-nv40/instmem: duplicate nv04 code as nv40, remove alternate paths git bisect good af7afbd2e1409168698bde2f2846848b07d05d12 # bad: [70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9] drm/nv04-nv40/fifo: remove use of nouveau_gpuobj_new_fake() git bisect bad 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 # good: [5787640db6ae722aeadb394d480c7ca21b603e34] drm/nv04-nv40/instmem: remove use of nouveau_gpuobj_new_fake() git bisect good 5787640db6ae722aeadb394d480c7ca21b603e34 # first bad commit: [70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9] drm/nv04-nv40/fifo: remove use of nouveau_gpuobj_new_fake()
If I understand bisect, then the next useful step is to investigate the commits between the last good commit and the kernel known not to work: git bisect reset git bisect good 5787640db6ae722aeadb394d480c7ca21b603e34 git bisect bad v3.7.9 I tried this and got some success, see log below. However I do not unterstand what influence AoE could have on nouveau. And I had ATA over Ethernet support always disabled by kernel config. But I made two further bisect tests which resulted in the same bad commit: (1) replay the last 2 bisects and do more tests (2) to ensure hardware defects I attached the HDD to a different system (same motherboard type) and repeated the last 2 bisect tests Both tests confirmed the commit below to be the cause of the instability. I do not understand why the disabled AoE changes something with nouveau. The only thing I could imagine is that the problem is dependent on the memory layout or (physical) base address of the driver which depends on total different drivers. git bisect logfile: git bisect start # good: [5787640db6ae722aeadb394d480c7ca21b603e34] drm/nv04-nv40/instmem: remove use of nouveau_gpuobj_new_fake() git bisect good 5787640db6ae722aeadb394d480c7ca21b603e34 # bad: [5b7be6344b4177fa55d128de75b0e5b42229fd37] Linux 3.7.9 git bisect bad 5b7be6344b4177fa55d128de75b0e5b42229fd37 # good: [4bcce1a355c8248fb5661cb78bb14b9e19475cd4] aoe: retain static block device numbers for backwards compatibility git bisect good 4bcce1a355c8248fb5661cb78bb14b9e19475cd4 # bad: [35bafbee4b4732a2820bbd0ef141c8192ff29731] Merge tag 'disintegrate-mips-20121009' of git://git.infradead.org/users/dhowells/linux-headers into mips-for-linux-next git bisect bad 35bafbee4b4732a2820bbd0ef141c8192ff29731 # bad: [d43b7167d4c74137f9a6c61fdcead127d60357f9] Merge branch 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild git bisect bad d43b7167d4c74137f9a6c61fdcead127d60357f9 # good: [bd0d10498826ed150da5e4c45baf8b9c7088fb71] Merge branch 'staging/for_v3.7' into v4l_for_linus git bisect good bd0d10498826ed150da5e4c45baf8b9c7088fb71 # bad: [84424026c0a910886064049d414a12a4f4dd125e] Merge tag 'defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc git bisect bad 84424026c0a910886064049d414a12a4f4dd125e # bad: [5f3d2f2e1a63679cf1c4a4210f2f1cc2f335bef6] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc git bisect bad 5f3d2f2e1a63679cf1c4a4210f2f1cc2f335bef6 # good: [9aae341287f55d5fc71f8a884e671f9058ad3388] Merge remote-tracking branch 'agust/next' into next git bisect good 9aae341287f55d5fc71f8a884e671f9058ad3388 # good: [0dd96360e21ec7963aeba253261db87a32e728c6] mfd: rc5t583: Fix warning messages git bisect good 0dd96360e21ec7963aeba253261db87a32e728c6 # bad: [11126c611e10abb18b6f1ed0300c0548c3906b54] Merge branch 'akpm' (Andrew's patch-bomb) git bisect bad 11126c611e10abb18b6f1ed0300c0548c3906b54 # good: [837c8293ba24d08cd7438d82ad9bb8d2fb0f8a5b] mfd: 88pm860x: Use irqdomain git bisect good 837c8293ba24d08cd7438d82ad9bb8d2fb0f8a5b # bad: [c99b6841d74a5c7d3698cc2a3ec44241fe64b769] omfs: convert to use beXX_add_cpu() git bisect bad c99b6841d74a5c7d3698cc2a3ec44241fe64b769 # bad: [1ac9e602625817b0c16cc70ea496875f7bd58a4d] aoe: remove unused code git bisect bad 1ac9e602625817b0c16cc70ea496875f7bd58a4d # bad: [08b60623510aebddd9ac4bf61dbe2d39313dddfd] aoe: make dynamic block minor numbers the default git bisect bad 08b60623510aebddd9ac4bf61dbe2d39313dddfd # bad: [7159e969d1963f19e7550aafd234b0c5361e5d69] aoe: update and specify AoE address guards and error messages git bisect bad 7159e969d1963f19e7550aafd234b0c5361e5d69 # first bad commit: [7159e969d1963f19e7550aafd234b0c5361e5d69] aoe: update and specify AoE address guards and error messages
Created attachment 367242 [details, diff] reversed.patch You can try to apply this reversed patch on the bad commit to see if it fixes it; if that's the case, then you indeed found the bad commit. Probably indeed has something to do with the change in error logic or the range; and yeah, I think it has to do with the memory allocation ranges or something like that. Download the patch, put it in the folder where you did the bisect and then do: patch -p1 < reversed.patch Then rebuild the kernel and test it.
I tried the patch with the last bisected (bad) version, but the crash stays even with the patch :-( This is really strange. All I can reliably say at the moment is that everything works when I git bisect replay up to (including) this point: [beginning of identical bisect.log from above deleted] # bad: [11126c611e10abb18b6f1ed0300c0548c3906b54] Merge branch 'akpm' (Andrew's patch-bomb) git bisect bad 11126c611e10abb18b6f1ed0300c0548c3906b54 Because it works stable at this point I say $~ git bisect good Bisecting: 7 revisions left to test after this (roughly 3 steps) [c99b6841d74a5c7d3698cc2a3ec44241fe64b769] omfs: convert to use beXX_add_cpu() which results in the next new lines in bisect.log: # good: [837c8293ba24d08cd7438d82ad9bb8d2fb0f8a5b] mfd: 88pm860x: Use irqdomain git bisect good 837c8293ba24d08cd7438d82ad9bb8d2fb0f8a5b After this step each bisected kernel shows the problem up to the end, where the mentioned commit in aoe is blamed. git bisect says that 7 revisions are left at this point, so I assume there are 7 commits that are added from the last known good state, and at least one of them causes the problem to appear. Is this correct?
Make absolutely sure that the reverse patch test is correct and that you have replaced the kernel; alternatively, you could not do the reverse patch but instead stop the bisect and run `git checkout -f 4bcce1a355c8248fb5661cb78bb14b9e19475cd4` which will take the commit before the one marked bad. It's this commit that needs to work, otherwise the one we marked as bad isn't broken and something else went wrong. If that's not helping, here is an alternative idea to proceed with a new bisect: - As the bad commit; now pick the bad commit we have found (7159e969d1963f19e7550aafd234b0c5361e5d69), we've verified that this commit is broken so it definitely fits as a new upper limit of the commits to check. - As the good commit, try an earlier (in time) commit than you have selected before; if you start with a different commit, you get to test different commits and thus the output may be more accurate. This interval will likely be smaller; and thus, shouldn't take too long to try. Your bisect log in comment #12 is odd, if I do: # git bisect start # git bisect bad 11126c611e10abb18b6f1ed0300c0548c3906b54 # git bisect good 837c8293ba24d08cd7438d82ad9bb8d2fb0f8a5b Then it tells me: Bisecting: 4318 revisions left to test after this (roughly 12 steps) These are roughly 12 steps; however, your comment only shows 5 steps. I'm scratching my hair over what's going on here; I've only got one more tip to maybe make this easier, that is to check `git bisect visualize` to get an idea what's going on. You can see after each step how it cuts the range in half.
FYI. I put my patch upstream: https://bugzilla.kernel.org/attachment.cgi?id=131911 Now I from this PC, 3.12.16, all good and I believe this is right. But you can test too - before I get sometimes good uptime with new kernels before crash.
I tried the patch but the crashes remain. However I can not give an indication whether stability is improved by the patch because crashes do not appear regularly...
Watching upstream bug