Summary: | Slow kernel memory leak | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Bruce Guenter <bruce> |
Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | RESOLVED UPSTREAM | ||
Severity: | normal | CC: | gentoobugs |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: | Output of free, proc/meminfo, proc/slabinfo |
Description
Bruce Guenter
2005-05-05 14:09:31 UTC
Please test vanilla-sources-2.6.12_rc3 I have booted vanilla-sources-2.6.12_rc3, and it still appears to be leaking, possibly worse than before. I am down to just over 1GB of free memory after two days of uptime. Anything else? Next suggestion would be to mail the linux kernel list like you have already done. Provide any info that they ask for, and reopen this bug once you find a solution for the problem. I have been running vanilla-sources-2.6.12_rc3 (with one small patch to track page ownership) for almost 10 days now, and no leaks are showing up. The only kernels I can conclusively state leaked memory are the gentoo-sources series, specifically 2.6.11-r4 and -r6. I have presumably the same problem with practically all kernel versions of gentoo-sources and hardened-sources, at least since 2.6.8* (I haven't tried earlier ones yet) on an amd64 with both, 32 and 64 bit, kernels/installations. I am wondering why nobody else seems to have this problem. Unfortunately, the reproducibility is not so good and the computer has to run rather long until the problem happens (I tried with many kernel configurations, and sometimes I had thought the problem has vanished, but then all of a sudden it was back). However, in my case the memory usually fills (sometimes) when compiling c++ projects. For example, a complete kde compile will often not succeed without killing some random processes (usually some compiler tasks itself are killed so that the emerge ends during "make" with "internal error: killed"). Surprisingly, increasing the swap space seems to have no influence at all: in one test a task was getting killed even after 30 minutes of uptime even with an additional 16 gig swapfile (although the kernel swapped like crazy). [For a while I was thinking about a thermal hardware problem, but this does not seem to be the case either, since "nicing" the processes and limiting the cpu frequency while simultaneously opening the tower and using an additional cooling also had no influence. Moreover, the reproducibility seems too good to be a hardware problem.] So, Bruce, maybe it helps you to provoke/speed up the problem by compiling kde several times? (Do not forget to make sure that no compiler cache is used by renaming /usr/bin/ccache in the case that you installed it - IIRC only removing ccache from the FEATURES list was not enough). Maybe this bug is a duplicate of 58969 (at least my above comments seem to have a relation with that bug). Please see my comments there. I observed the problem now also with vanilla-sources (I tested with 2.6.12_rc5 and used genkernel --udev without changing anything in the default kernel .config). If you can reproduce it on 2.6.12-rc5 then it is an upstream issue, not one caused by gentoo's kernel patches. Read Bruce's discussion and gather some information about your problem: http://thread.gmane.org/gmane.linux.kernel/301432 Then write your own report to the linux kernel mailing list. (In reply to comment #7) > If you can reproduce it on 2.6.12-rc5 then it is an upstream issue, not one > caused by gentoo's kernel patches. Yes, it is not caused by the *kernel* patches. But the problem only happens with the Gentoo-compiled kernel: It seems that when I boot my SuSE system and chroot to the Gentoo partition, there are no problems (it *might* be accidental, but I retried several times, compiling successfully the "usual suspects"). And today I observed something even stranger: I copied from an old backup the kernel generated from gentoo-sources-2.6.9-r14 and it also worked! However, after recompiling the *same* version (well, almost: I recompiled 2.6.9-r9 because the other one is not in the portage tree anymore), using /proc/config.gz from the running 2.6.9-r14-configuration (and using genkernel), I got a kernel which exhibits the memory leak again! I have really no idea how this is possible (but I tried both kernels several times, and always the "old" 2.6.9-r14 worked and the "newly compiled" 2.6.9-r9 failed). My only idea is that my toolchain produces a wrong kernel which, however, works perfectly except for this memory leak - this does not sound very likely to me. I am currently re-bootstrapping my toolchain (using only the most stable versions with no optimization) and will then recompile the kernel. When I find something new, I will let you know (but I am very busy these days, so it might take some time). Just for the records: No difference with the current stable toolchain. It's very unlikely - nothing in userspace can directly cause a kernel memory leak (but then again, you haven't actually posted any numbers, so it might not be the kernel that is leaking...) It's not a fair comparison with suse unless you are running exactly the same kernel on both. Are you? There is also no point playing with old kernels like 2.6.9. Reproduce it on the current development version and provide some numbers to the kernel developers. Thats the only way this will get solved. Created attachment 60657 [details]
Output of free, proc/meminfo, proc/slabinfo
This is the output after many "emerge"s when the system is almost swapping dead
for no apparent reason.
You need to post this to the Linux kernel list like Bruce did. Somehow my additional comment seemed to get lost, so I repost it (sorry if this should be doubled now). (In reply to comment #12) > You need to post this to the Linux kernel list like Bruce did. I understood what you mean, but as I wrote, the SuSE kernel and the old gentoo kernel (from practically the same sources with the same .config) seems to work, but a kernel freshly compiled under gentoo does not. So the reason probably is not in the gentoo/vanilla-sources but more in its interplay with gentoo - to me it is completely mysterious. But if there are no other ideas maybe I will write to the kernel list anyway. (In reply to comment #10) > It's very unlikely - nothing in userspace can directly cause a kernel memory > leak (but then again, you haven't actually posted any numbers, so it might > not be the kernel that is leaking...) I wrote this thing about the toolchain, because the only explanation for the different behaviour for me seems that something is wrong with the compilation process itself. But even after re-bootstrapping the toolchain (i.e. re-emerging linux-headers,gcc,binutils,glibc sufficiently often) a freshly compiled kernel does not work (and I tried several kernel versions - older and newer ones). Concerning the missing data: There are actually two effects which I believe have the same cause, but I might be wrong: 1. The only effect which I can provoke is that when compiling certain .cc-files with makeopts="-j2" and optimization C*FLAGS usually compilation dies with "internal error: killed" (or sometimes also processes of other users are killed instead). 2. The other effect happens only after compiling many (~100 or more) .cc-projects: The system slows down dramatically with lots of harddisk acces and often is practically dead (response time for a keypress maybe minutes). The output of comment #11 is from such a situation. If in 2. the system is not dead, effect 1. happens much more often - that's why I believe it is actually the same problem. > It's not a fair comparison with suse unless you are running exactly the same > kernel on both. Are you? I did not want to compare; but I simply have no explanation: SuSE's and the old gentoo kernel (which I now lost due to a stupid mistake) were the only "working" kernels which did not show the effect of 1. - instead, they start swapping at about the same time during compilation as the new compiled kernels (older and newer) would usually start killing random processes. Regardless of which distro you see a leak on, if the latest unmodified development kernel (vanilla-sources-2.6.12_rc5) is leaking then it is a kernel bug. This may be triggered by a scenario present in Gentoo that is not present in SUSE but no user space program should be able to make the kernel leak (and if this is the case, then its a kernel bug). If a big leak is triggered in user space, it is usually regarded as a DoS (denial of service) attack because a standard user account can easily bring down the box. I found the main cause: The nvidia-kernel module (the problem occured also without X - therefore I had not thought of this cause - but I had the nvidia module listed in /etc/modules.autoload.d and my scripts had always compiled the module). The earlier gentoo and SuSE kernels used of course different nvidia-kernel versions which explains the different behaviour. With nvidia-kernel-1.0.7664 the reproducible part of the problem has vanished. Anyway, there still seems to vanish some memory, but currently I have not time for further investigations (and it seems hopeless anyway, since the vanishing is too slow for systematic experiments). |