I am seeing a slow memory leak in the kernel. I am using gentoo-sources 2.6.11-r6, but also observed it in 2.6.11-r4. Over the course of several days, the server in question has the amount of available memory (free minus buffers+cache) gradually decrease. The rate is about 150MB per day (the system has 2GB of RAM total). The working set of processes remains the same through the whole period at between 50-150MB (depending on if you count VSZ or RSS). Nothing shows up in dmesg except for a couple of one-time lockd and nfs messages (the system uses two remote filesystems). The local filesystems are ReiserFS on a 3Ware 7500-4 controller, and the NIC is an Intel E100. # free total used free shared buffers cached Mem: 2076180 2024068 52112 0 166760 93200 -/+ buffers/cache: 1764108 312072 Swap: 1028152 56 1028096 # cat /proc/meminfo MemTotal: 2076180 kB MemFree: 63080 kB Buffers: 158776 kB Cached: 91664 kB SwapCached: 4 kB Active: 1055244 kB Inactive: 874660 kB HighTotal: 1179072 kB HighFree: 640 kB LowTotal: 897108 kB LowFree: 62440 kB SwapTotal: 1028152 kB SwapFree: 1028096 kB Dirty: 768 kB Writeback: 0 kB Mapped: 12648 kB Slab: 69872 kB CommitLimit: 2066240 kB Committed_AS: 26316 kB PageTables: 1492 kB VmallocTotal: 114680 kB VmallocUsed: 4700 kB VmallocChunk: 109784 kB # lsmod Module Size Used by nfs 91180 2 lockd 58920 2 nfs sunrpc 125764 5 nfs,lockd e100 31872 0 mii 4352 1 e100 # lspci 0000:00:00.0 Host bridge: Intel Corp. E7500 Memory Controller Hub (rev 03) 0000:00:00.1 Class ff00: Intel Corp. E7500/E7501 Host RASUM Controller (rev 03) 0000:00:02.0 PCI bridge: Intel Corp. E7500/E7501 Hub Interface B PCI-to-PCI Bridge (rev 03) 0000:00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev 42) 0000:00:1f.0 ISA bridge: Intel Corp. 82801CA LPC Interface Controller (rev 02) 0000:00:1f.1 IDE interface: Intel Corp. 82801CA Ultra ATA Storage Controller (rev 02) 0000:00:1f.3 SMBus: Intel Corp. 82801CA/CAM SMBus Controller (rev 02) 0000:01:1c.0 PIC: Intel Corp. 82870P2 P64H2 I/OxAPIC (rev 03) 0000:01:1d.0 PCI bridge: Intel Corp. 82870P2 P64H2 Hub PCI Bridge (rev 03) 0000:01:1e.0 PIC: Intel Corp. 82870P2 P64H2 I/OxAPIC (rev 03) 0000:01:1f.0 PCI bridge: Intel Corp. 82870P2 P64H2 Hub PCI Bridge (rev 03) 0000:03:01.0 RAID bus controller: 3ware Inc 3ware Inc 3ware 7xxx/8xxx-series PATA/SATA-RAID (rev 01) 0000:04:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 0000:04:04.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0d) 0000:04:05.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0d) I would be happy to provide any additional information. As it stands, I have to reboot about once a week to clear the RAM or else it thrashes itself to death. Reproducible: Always Steps to Reproduce: # emerge info Portage 2.0.51.19 (default-linux/x86/2005.0, gcc-3.3.5-20050130, glibc-2.3.4.20041102-r1, 2.6.11-gentoo-r6 i686) ================================================================= System uname: 2.6.11-gentoo-r6 i686 Intel(R) Xeon(TM) CPU 2.00GHz Gentoo Base System version 1.4.16 Python: dev-lang/python-2.3.4-r1 [2.3.4 (#1, Mar 28 2005, 01:06:34)] dev-lang/python: 2.3.4-r1 sys-apps/sandbox: [Not Present] sys-devel/autoconf: 2.59-r6, 2.13 sys-devel/automake: 1.7.9-r1, 1.8.5-r3, 1.5, 1.4_p6, 1.6.3, 1.9.4 sys-devel/binutils: 2.15.92.0.2-r7 sys-devel/libtool: 1.5.14 virtual/os-headers: 2.6.8.1-r2 ACCEPT_KEYWORDS="x86" AUTOCLEAN="yes" CFLAGS="-O3 -mcpu=i686 -fomit-frame-pointer -fstack-protector" CHOST="i386-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3/share/config /usr/local/clockspeed/etc /usr/share/config /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-O3 -mcpu=i686 -fomit-frame-pointer -fstack-protector" DISTDIR="/usr/portage/distfiles" FEATURES="autoaddcvs autoconfig ccache collision-protect digest distlocks notitles sandbox sfperms strict userpriv usersandbox" GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/Linux/distributions/gentoo" MAKEOPTS="-j1" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/portage/BG /usr/portage/FQ" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="x86 alsa apache2 apm berkdb bitmap-fonts crypt emacs emboss encode ethereal fortran gdbm gif gtk2 imlib ipv6 jpeg libg++ libwww mp3 mysql ncurses nls pam perl png python readline skey snmp spell ssl tcpd truetype-fonts type1-fonts xml2 zlib userland_GNU kernel_linux elibc_glibc" Unset: ASFLAGS, CBUILD, CTARGET, LANG, LC_ALL, LDFLAGS, LINGUAS
Please test vanilla-sources-2.6.12_rc3
I have booted vanilla-sources-2.6.12_rc3, and it still appears to be leaking, possibly worse than before. I am down to just over 1GB of free memory after two days of uptime. Anything else?
Next suggestion would be to mail the linux kernel list like you have already done. Provide any info that they ask for, and reopen this bug once you find a solution for the problem.
I have been running vanilla-sources-2.6.12_rc3 (with one small patch to track page ownership) for almost 10 days now, and no leaks are showing up. The only kernels I can conclusively state leaked memory are the gentoo-sources series, specifically 2.6.11-r4 and -r6.
I have presumably the same problem with practically all kernel versions of gentoo-sources and hardened-sources, at least since 2.6.8* (I haven't tried earlier ones yet) on an amd64 with both, 32 and 64 bit, kernels/installations. I am wondering why nobody else seems to have this problem. Unfortunately, the reproducibility is not so good and the computer has to run rather long until the problem happens (I tried with many kernel configurations, and sometimes I had thought the problem has vanished, but then all of a sudden it was back). However, in my case the memory usually fills (sometimes) when compiling c++ projects. For example, a complete kde compile will often not succeed without killing some random processes (usually some compiler tasks itself are killed so that the emerge ends during "make" with "internal error: killed"). Surprisingly, increasing the swap space seems to have no influence at all: in one test a task was getting killed even after 30 minutes of uptime even with an additional 16 gig swapfile (although the kernel swapped like crazy). [For a while I was thinking about a thermal hardware problem, but this does not seem to be the case either, since "nicing" the processes and limiting the cpu frequency while simultaneously opening the tower and using an additional cooling also had no influence. Moreover, the reproducibility seems too good to be a hardware problem.] So, Bruce, maybe it helps you to provoke/speed up the problem by compiling kde several times? (Do not forget to make sure that no compiler cache is used by renaming /usr/bin/ccache in the case that you installed it - IIRC only removing ccache from the FEATURES list was not enough).
Maybe this bug is a duplicate of 58969 (at least my above comments seem to have a relation with that bug). Please see my comments there. I observed the problem now also with vanilla-sources (I tested with 2.6.12_rc5 and used genkernel --udev without changing anything in the default kernel .config).
If you can reproduce it on 2.6.12-rc5 then it is an upstream issue, not one caused by gentoo's kernel patches. Read Bruce's discussion and gather some information about your problem: http://thread.gmane.org/gmane.linux.kernel/301432 Then write your own report to the linux kernel mailing list.
(In reply to comment #7) > If you can reproduce it on 2.6.12-rc5 then it is an upstream issue, not one > caused by gentoo's kernel patches. Yes, it is not caused by the *kernel* patches. But the problem only happens with the Gentoo-compiled kernel: It seems that when I boot my SuSE system and chroot to the Gentoo partition, there are no problems (it *might* be accidental, but I retried several times, compiling successfully the "usual suspects"). And today I observed something even stranger: I copied from an old backup the kernel generated from gentoo-sources-2.6.9-r14 and it also worked! However, after recompiling the *same* version (well, almost: I recompiled 2.6.9-r9 because the other one is not in the portage tree anymore), using /proc/config.gz from the running 2.6.9-r14-configuration (and using genkernel), I got a kernel which exhibits the memory leak again! I have really no idea how this is possible (but I tried both kernels several times, and always the "old" 2.6.9-r14 worked and the "newly compiled" 2.6.9-r9 failed). My only idea is that my toolchain produces a wrong kernel which, however, works perfectly except for this memory leak - this does not sound very likely to me. I am currently re-bootstrapping my toolchain (using only the most stable versions with no optimization) and will then recompile the kernel. When I find something new, I will let you know (but I am very busy these days, so it might take some time).
Just for the records: No difference with the current stable toolchain.
It's very unlikely - nothing in userspace can directly cause a kernel memory leak (but then again, you haven't actually posted any numbers, so it might not be the kernel that is leaking...) It's not a fair comparison with suse unless you are running exactly the same kernel on both. Are you? There is also no point playing with old kernels like 2.6.9. Reproduce it on the current development version and provide some numbers to the kernel developers. Thats the only way this will get solved.
Created attachment 60657 [details] Output of free, proc/meminfo, proc/slabinfo This is the output after many "emerge"s when the system is almost swapping dead for no apparent reason.
You need to post this to the Linux kernel list like Bruce did.
Somehow my additional comment seemed to get lost, so I repost it (sorry if this should be doubled now). (In reply to comment #12) > You need to post this to the Linux kernel list like Bruce did. I understood what you mean, but as I wrote, the SuSE kernel and the old gentoo kernel (from practically the same sources with the same .config) seems to work, but a kernel freshly compiled under gentoo does not. So the reason probably is not in the gentoo/vanilla-sources but more in its interplay with gentoo - to me it is completely mysterious. But if there are no other ideas maybe I will write to the kernel list anyway. (In reply to comment #10) > It's very unlikely - nothing in userspace can directly cause a kernel memory > leak (but then again, you haven't actually posted any numbers, so it might > not be the kernel that is leaking...) I wrote this thing about the toolchain, because the only explanation for the different behaviour for me seems that something is wrong with the compilation process itself. But even after re-bootstrapping the toolchain (i.e. re-emerging linux-headers,gcc,binutils,glibc sufficiently often) a freshly compiled kernel does not work (and I tried several kernel versions - older and newer ones). Concerning the missing data: There are actually two effects which I believe have the same cause, but I might be wrong: 1. The only effect which I can provoke is that when compiling certain .cc-files with makeopts="-j2" and optimization C*FLAGS usually compilation dies with "internal error: killed" (or sometimes also processes of other users are killed instead). 2. The other effect happens only after compiling many (~100 or more) .cc-projects: The system slows down dramatically with lots of harddisk acces and often is practically dead (response time for a keypress maybe minutes). The output of comment #11 is from such a situation. If in 2. the system is not dead, effect 1. happens much more often - that's why I believe it is actually the same problem. > It's not a fair comparison with suse unless you are running exactly the same > kernel on both. Are you? I did not want to compare; but I simply have no explanation: SuSE's and the old gentoo kernel (which I now lost due to a stupid mistake) were the only "working" kernels which did not show the effect of 1. - instead, they start swapping at about the same time during compilation as the new compiled kernels (older and newer) would usually start killing random processes.
Regardless of which distro you see a leak on, if the latest unmodified development kernel (vanilla-sources-2.6.12_rc5) is leaking then it is a kernel bug. This may be triggered by a scenario present in Gentoo that is not present in SUSE but no user space program should be able to make the kernel leak (and if this is the case, then its a kernel bug). If a big leak is triggered in user space, it is usually regarded as a DoS (denial of service) attack because a standard user account can easily bring down the box.
I found the main cause: The nvidia-kernel module (the problem occured also without X - therefore I had not thought of this cause - but I had the nvidia module listed in /etc/modules.autoload.d and my scripts had always compiled the module). The earlier gentoo and SuSE kernels used of course different nvidia-kernel versions which explains the different behaviour. With nvidia-kernel-1.0.7664 the reproducible part of the problem has vanished. Anyway, there still seems to vanish some memory, but currently I have not time for further investigations (and it seems hopeless anyway, since the vanishing is too slow for systematic experiments).