Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 125368 - nvidia-kernel: bug with NVIDIA_kernel-1.0-7167-disable-preempt-on-smp_processor_id.patch
Summary: nvidia-kernel: bug with NVIDIA_kernel-1.0-7167-disable-preempt-on-smp_process...
Status: VERIFIED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: High normal (vote)
Assignee: X11 External Driver Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-03-07 08:49 UTC by BlaisorBlade
Modified: 2006-05-03 02:52 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description BlaisorBlade 2006-03-07 08:49:51 UTC
That patch is bust, useless and introduces a bug. Zander's NVIDIA_kernel-1.0-6629-1182399.patch, integrated upstream, is better. It was rolled after yours, but you didn't notice that yours should have been dropped.

smp_processor_id is called after doing spin_lock_irq, which disables preemption - how can that give the warnings you describe in the ChangeLog, with the current code?

  Indeed, the patch that in 6629 fixed (IMHO) that "smp_processor_id in preemptable code" is NVIDIA_kernel-1.0-6629-1182399.patch, which introduced the use of get_cpu() instead of the first smp_processor_id. At least, it's sufficient.

  The current version of 1.0.6629/nv-disable-preempt-on-smp_processor_id.patch is therefore dummy; however, I can assume it made at least a bit of sense and fixed the warnings (while still having the imbalance bug) before Zander rolled his patch, and that you updated it wrongly.

  However, let's suppose I'm wrong, and let's go on.
  While inspecting the patch, I found that the patched code (only with the patch) also contains a preemption imbalance on an early return path, for the case where spinlock recursion is detected:

void NV_API_CALL nv_lock_rm(
    nv_state_t *nv
)
{
    nv_linux_state_t *nvl;
    int cpu;

    nvl = NV_GET_NVL_FROM_NV_STATE(nv);
+    preempt_disable();
    cpu = get_cpu();

    if (nvl->rm_lock_cpu == cpu)
    {
        nvl->rm_lock_count++;
        put_cpu();
//XXXXXXX: no preempt_enable()!!!
        return;
    }

    put_cpu();
    spin_unlock_wait(&nvl->rm_lock);
    spin_lock_irq(&nvl->rm_lock);

    nvl->rm_lock_cpu = smp_processor_id();
    nvl->rm_lock_count = 1;
//Location END
+    preempt_enable();
}

  Additionally, since on 2.4 the patch isn't needed and on 2.6 get_cpu() calls preempt_disable(), the patch is bad style: instead of adding preempt_enable() to "Location END", it would be a bit better to move put_cpu() there. This would also avoid the preemption imbalance.

  The patch to do this (which replaces *totally* the above one) is like:
--- nv.orig/nv.c
+++ nv/nv.c
@@ -3361,12 +3361,12 @@ void NV_API_CALL nv_lock_rm(
         return;
     }

-    put_cpu();
     spin_unlock_wait(&nvl->rm_lock);
     spin_lock_irq(&nvl->rm_lock);

     nvl->rm_lock_cpu = smp_processor_id();
     nvl->rm_lock_count = 1;
+    put_cpu();
 }

 void NV_API_CALL nv_unlock_rm(

*)  Additionally, the thing (either the existing patch or the one above) has latency problems - that spin_unlock_wait() call has the only meaning to wait that the spinlock is released while keeping preemption and irqs enabled; while irqs are still kept enabled until the lock is released by someone else, preemption is disabled, with the patch.
Comment 1 BlaisorBlade 2006-03-07 08:52:25 UTC
Please assign this bug to azarah which I cc'ed, since he introduced the patch.
Comment 2 Jeremy Huddleston (RETIRED) gentoo-dev 2006-03-24 14:00:45 UTC
Do you have a link to Zander's NVIDIA_kernel-1.0-6629-1182399.patch that I can compare to ours?
Comment 3 BlaisorBlade 2006-03-24 16:02:08 UTC
That link is 

/usr/portage/media-video/nvidia-kernel/files/1.0.6629/NVIDIA_kernel-1.0-6629-1182399.patch

(I wouldn't search for the web-link now, it's much more awkward, but the patch 
is the same).

You (Gentoo) fetched that from the forum, in fact. See my first comment for 
the description of the supposed patch history, where I say:

==
The current version of 1.0.6629/nv-disable-preempt-on-smp_processor_id.patch
is therefore dummy; however, I can assume it made at least a bit of sense and
fixed the warnings (while still having the imbalance bug) before Zander rolled
his patch, and that you updated it wrongly.
==

Bye (btw, I tried to send this by answering email, but it seems this doesn't work here - if it does this will be duplicated, sorry for that).
Comment 4 BlaisorBlade 2006-04-24 08:02:40 UTC
Sorry, but any news on this? On my side I'm just waiting a response, and I'm still confident on what I said. Need any further info or explaination?
Comment 5 Martin Schlemmer (RETIRED) gentoo-dev 2006-04-25 00:53:22 UTC
I rather wanted to abstain from replying, as I haven't really touched the nvidia drivers since I added a _fixed_ version of the patch _before_ Zander released his version, and some other reasons below.

Whoever incorporated Zander's fix, and then reverted to the broken version of the patch (mismatched preempt_enable()) - see files/1.0.6629/nv-disable-preempt-on-smp_processor_id-2.patch in viewcvs.

So basically, the patch is faulty as pointed out, and also unneeded as pointed out.

I should point out that the reporter could actually have checked what really happened instead of trying to throw all the blame at me.  Whatever the patch looks like now, it is entirely the work of other people, and in fact did not even have get_cpu() call back then (and yes, I did a thinko with the first one, but it was fixed with the second version of the patch).  Nobody checked with me about the patch back then, and unfortunately I get stretches that I really cannot do much in regards to Gentoo (like now) - as was the problem back then.
Comment 6 Kris Kersey (RETIRED) gentoo-dev 2006-04-26 06:08:26 UTC
Should I just remove the patch then and all will be corrected?
Comment 7 BlaisorBlade 2006-04-26 07:29:23 UTC
I'm answering to both Kris (for his question) and Martin (to excuse myself).

For Kris: yes, everything I described will be resolved by removing the patch, and no further action is needed. I think that Martin also agrees, by reading his comment (I guess your doubt was only about whether another fix was needed).

Indeed, every version except 6111 contains the patch we're going to remove; Zander's patch is applied explicitly in 6629 ebuild and should be already included in Nvidia sources for subsequent releases. So we can indeed remove it from all nvidia-kernel ebuilds - note there are 2 copies of the patch, both can be removed:

$ ls /usr/portage/media-video/nvidia-kernel/files/1.0.*/*-disable-preempt-on-smp_processor_id.patch

/usr/portage/media-video/nvidia-kernel/files/1.0.6629/nv-disable-preempt-on-smp_processor_id.patch
/usr/portage/media-video/nvidia-kernel/files/1.0.7167/NVIDIA_kernel-1.0-7167-disable-preempt-on-smp_processor_id.patch

===

For Martin:
First, thanks for replying, even if I've maybe been a bit rude - now the bug can be solved easily.

I didn't want to throw all the blame onto you, I thought you were involved since I saw the entry in the ChangeLog (or whatever, I don't recall).

I'm very sorry if I've been a bit rude in describing the bug (and I must admit I've been).

I didn't want to fully track it down, I've just reported a bug; yes I could, but since I wanted mainly to discuss the merit of the code and not to judge anybody, I didn't.
Comment 8 Martin Schlemmer (RETIRED) gentoo-dev 2006-05-02 06:25:08 UTC
(In reply to comment #6)
> Should I just remove the patch then and all will be corrected?
> 

Yup
Comment 9 Kris Kersey (RETIRED) gentoo-dev 2006-05-02 07:16:51 UTC
Sorry for the delay.  I have committed the suggested change to all ebuilds.  I didn't bump any versions because I didn't think it would be necessary for everyone to rebuild.
Comment 10 BlaisorBlade 2006-05-03 02:52:16 UTC
If the bug triggers we get a potential instability (I think you need to be an SMP box at least to experience it); so I'm not sure that not rebuilding is a good idea. However I'm not sure either,so feel free to decide.