Hi, I've found that a simple perl oneliner that attempts to use large amounts of memory my system causes the system to become non-responsive and never recover. I tested this with gentoo-sources-2.4.19-r9. I tried vanilla-sources-2.4.19 and vanilla-sources-2.4.20, neither one of these kernels cause the lockups I see with the gentoo-sources. Under both of those kernels the process ends up being killed. I can only assume it is one of the patches that is applied. My system is running using the ~x86 keyword and is completely uptodate. The perl command that causes this lockup is: perl -we 'print 0x7cff_ffff .. 0x7fff_ffff' This is perl5.8.0 (latest for ~x86). If you would like, I can attach the .config file for each of the kernels I tried. Thanks, Scott
I played around with this a little more, and compiled a few more kernels. I didn't have too much time to spend on this, but what I tried ought to be helpful. First, I modified the ebuild to only apply patches 00 through 08 and recompiled the kernel. This did not fix it. Thinking that it might have something to do with the preemptive kernel (just a guess, I admit), I also compiled the same kernel with preemptive turned off. This also did not fix it. However, in testing I did notice something very interesting, while watching it with a -20 niced top. Top seemed to work perfectly well _until_ the load average got to 10 - at which point top stopped responding. But this is not what's interesting. What's interesting is that the perl process got to a point where it was occupying pretty much all free memory (about 980MB, if I remember correctly), and stopped growing in memory. The CPU usage wasn't high - generally at 0%, with a spike every couple seconds to full CPU (top was running with 0.1s intervals). Something else I should point out is that the system does not become totally unresponsive - you can still ping the system, and a Ctrl-C on the linux console _did_ kill the process, but took anywhere from just a couple seconds to 30 seconds. In a terminal in X (tested xterm and gnome-terminal), Ctrl-C didn't do anything, even after 10 minutes (at which point I became too impatient and did a hard reset). Now, on a vanilla kernel, when a process tries to grab too much memory, it dies and "Terminated" is generally displayed on the terminal. From what I've seen here, however, it looks as though the kernel is no longer killing processes that run out of memory. So, as best Scott and I can guess, memory allocation is not failing properly, thus causing a lockup. This might be why top stopped responding - when the load got to 10, it probably tried to reallocate the memory for the line, since it became a character longer. Also, a recommendation to anyone who tries to reproduce this - turn off swap first! Not doing so just makes the system take unnecessarily long to fill the swap device before exhibiting the above behaviour.
This is (or at least seems to be) fixed in gentoo-sources-2.4.20-r2 (not sure about -r1, since it refused to work on _several_ systems in our office, with different hardware and configurations). My guess would be something in these patches: * removed ck4 O(1) sched, ll, preempt patch * added rml preempt & ll Note that this bug still affects ck-sources, at least 2.4.20-r3, and possibly -r4.
glad to hear the new sources work. i'll try your little one liner on subsequent gentoo-sources kernel releases. Thanks, Jay