Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 91615

Summary: Slow kernel memory leak
Product: Gentoo Linux Reporter: Bruce Guenter <bruce>
Component: [OLD] Core systemAssignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel>
Status: RESOLVED UPSTREAM    
Severity: normal CC: gentoobugs
Priority: High    
Version: unspecified   
Hardware: All   
OS: Linux   
Whiteboard:
Package list:
Runtime testing required: ---
Attachments: Output of free, proc/meminfo, proc/slabinfo

Description Bruce Guenter 2005-05-05 14:09:31 UTC
I am seeing a slow memory leak in the kernel.  I am using gentoo-sources 2.6.11-r6, but also observed it in 2.6.11-r4.  Over the course of several days, the server in question has the amount of available memory (free minus buffers+cache) gradually decrease.  The rate is about 150MB per day (the system has 2GB of RAM total).  The working set of processes remains the same through the whole period at between 50-150MB (depending on if you count VSZ or RSS).  Nothing shows up in dmesg except for a couple of one-time lockd and nfs messages  (the system uses two remote filesystems).  The local filesystems are ReiserFS on a 3Ware 7500-4 controller, and the NIC is an Intel E100.

# free
             total       used       free     shared    buffers     cached
Mem:       2076180    2024068      52112          0     166760      93200
-/+ buffers/cache:    1764108     312072
Swap:      1028152         56    1028096
# cat /proc/meminfo 
MemTotal:      2076180 kB
MemFree:         63080 kB
Buffers:        158776 kB
Cached:          91664 kB
SwapCached:          4 kB
Active:        1055244 kB
Inactive:       874660 kB
HighTotal:     1179072 kB
HighFree:          640 kB
LowTotal:       897108 kB
LowFree:         62440 kB
SwapTotal:     1028152 kB
SwapFree:      1028096 kB
Dirty:             768 kB
Writeback:           0 kB
Mapped:          12648 kB
Slab:            69872 kB
CommitLimit:   2066240 kB
Committed_AS:    26316 kB
PageTables:       1492 kB
VmallocTotal:   114680 kB
VmallocUsed:      4700 kB
VmallocChunk:   109784 kB
# lsmod
Module                  Size  Used by
nfs                    91180  2 
lockd                  58920  2 nfs
sunrpc                125764  5 nfs,lockd
e100                   31872  0 
mii                     4352  1 e100
# lspci
0000:00:00.0 Host bridge: Intel Corp. E7500 Memory Controller Hub (rev 03)
0000:00:00.1 Class ff00: Intel Corp. E7500/E7501 Host RASUM Controller (rev 03)
0000:00:02.0 PCI bridge: Intel Corp. E7500/E7501 Hub Interface B PCI-to-PCI Bridge (rev 03)
0000:00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev 42)
0000:00:1f.0 ISA bridge: Intel Corp. 82801CA LPC Interface Controller (rev 02)
0000:00:1f.1 IDE interface: Intel Corp. 82801CA Ultra ATA Storage Controller (rev 02)
0000:00:1f.3 SMBus: Intel Corp. 82801CA/CAM SMBus Controller (rev 02)
0000:01:1c.0 PIC: Intel Corp. 82870P2 P64H2 I/OxAPIC (rev 03)
0000:01:1d.0 PCI bridge: Intel Corp. 82870P2 P64H2 Hub PCI Bridge (rev 03)
0000:01:1e.0 PIC: Intel Corp. 82870P2 P64H2 I/OxAPIC (rev 03)
0000:01:1f.0 PCI bridge: Intel Corp. 82870P2 P64H2 Hub PCI Bridge (rev 03)
0000:03:01.0 RAID bus controller: 3ware Inc 3ware Inc 3ware 7xxx/8xxx-series PATA/SATA-RAID (rev 01)
0000:04:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
0000:04:04.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0d)
0000:04:05.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0d)

I would be happy to provide any additional information.  As it stands, I have to reboot about once a week to clear the RAM or else it thrashes itself to death.

Reproducible: Always
Steps to Reproduce:




# emerge info
Portage 2.0.51.19 (default-linux/x86/2005.0, gcc-3.3.5-20050130,
glibc-2.3.4.20041102-r1, 2.6.11-gentoo-r6 i686)
=================================================================
System uname: 2.6.11-gentoo-r6 i686 Intel(R) Xeon(TM) CPU 2.00GHz
Gentoo Base System version 1.4.16
Python:              dev-lang/python-2.3.4-r1 [2.3.4 (#1, Mar 28 2005, 01:06:34)]
dev-lang/python:     2.3.4-r1
sys-apps/sandbox:    [Not Present]
sys-devel/autoconf:  2.59-r6, 2.13
sys-devel/automake:  1.7.9-r1, 1.8.5-r3, 1.5, 1.4_p6, 1.6.3, 1.9.4
sys-devel/binutils:  2.15.92.0.2-r7
sys-devel/libtool:   1.5.14
virtual/os-headers:  2.6.8.1-r2
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CFLAGS="-O3 -mcpu=i686 -fomit-frame-pointer -fstack-protector"
CHOST="i386-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3/share/config
/usr/local/clockspeed/etc /usr/share/config /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d"
CXXFLAGS="-O3 -mcpu=i686 -fomit-frame-pointer -fstack-protector"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoaddcvs autoconfig ccache collision-protect digest distlocks
notitles sandbox sfperms strict userpriv usersandbox"
GENTOO_MIRRORS="http://distfiles.gentoo.org
http://distro.ibiblio.org/pub/Linux/distributions/gentoo"
MAKEOPTS="-j1"
PKGDIR="/usr/portage/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/portage/BG /usr/portage/FQ"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="x86 alsa apache2 apm berkdb bitmap-fonts crypt emacs emboss encode ethereal
fortran gdbm gif gtk2 imlib ipv6 jpeg libg++ libwww mp3 mysql ncurses nls pam
perl png python readline skey snmp spell ssl tcpd truetype-fonts type1-fonts
xml2 zlib userland_GNU kernel_linux elibc_glibc"
Unset:  ASFLAGS, CBUILD, CTARGET, LANG, LC_ALL, LDFLAGS, LINGUAS
Comment 1 Daniel Drake (RETIRED) gentoo-dev 2005-05-06 10:54:29 UTC
Please test vanilla-sources-2.6.12_rc3
Comment 2 Bruce Guenter 2005-05-08 20:59:09 UTC
I have booted vanilla-sources-2.6.12_rc3, and it still appears to be leaking, possibly worse than before.  I am down to just over 1GB of free memory after two days of uptime.  Anything else?
Comment 3 Daniel Drake (RETIRED) gentoo-dev 2005-05-09 06:57:17 UTC
Next suggestion would be to mail the linux kernel list like you have already done. Provide any info that they ask for, and reopen this bug once you find a solution for the problem.
Comment 4 Bruce Guenter 2005-05-19 11:50:31 UTC
I have been running vanilla-sources-2.6.12_rc3 (with one small patch to track
page ownership) for almost 10 days now, and no leaks are showing up.  The only
kernels I can conclusively state leaked memory are the gentoo-sources series,
specifically 2.6.11-r4 and -r6.
Comment 5 Martin Väth 2005-05-25 12:50:07 UTC
I have presumably the same problem with practically all kernel versions of 
gentoo-sources and hardened-sources, at least since 2.6.8* (I haven't tried 
earlier ones yet) on an amd64 with both, 32 and 64 bit, kernels/installations. 
I am wondering why nobody else seems to have this problem. Unfortunately, the 
reproducibility is not so good and the computer has to run rather long until 
the problem happens (I tried with many kernel configurations, and sometimes I 
had thought the problem has vanished, but then all of a sudden it was back). 
However, in my case the memory usually fills (sometimes) when compiling c++ 
projects. For example, a complete kde compile will often not succeed without 
killing some random processes (usually some compiler tasks itself are killed 
so that the emerge ends during "make" with "internal error: killed"). 
Surprisingly, increasing the swap space seems to have no influence at all: in 
one test a task was getting killed even after 30 minutes of uptime even with 
an additional 16 gig swapfile (although the kernel swapped like crazy). [For a 
while I was thinking about a thermal hardware problem, but this does not seem 
to be the case either, since "nicing" the processes and limiting the cpu 
frequency while simultaneously opening the tower and using an additional 
cooling also had no influence. Moreover, the reproducibility seems too good to 
be a hardware problem.] 
So, Bruce, maybe it helps you to provoke/speed up the problem by compiling kde 
several times? (Do not forget to make sure that no compiler cache is used by 
renaming /usr/bin/ccache in the case that you installed it - IIRC only 
removing ccache from the FEATURES list was not enough). 
 
Comment 6 Martin Väth 2005-05-30 02:02:15 UTC
Maybe this bug is a duplicate of 58969 (at least my above comments seem to  
have a relation with that bug). Please see my comments there. 
  
I observed the problem now also with vanilla-sources (I tested with 2.6.12_rc5 
and used genkernel --udev without changing anything in the default 
kernel .config). 
 
Comment 7 Daniel Drake (RETIRED) gentoo-dev 2005-05-31 16:06:10 UTC
If you can reproduce it on 2.6.12-rc5 then it is an upstream issue, not one
caused by gentoo's kernel patches.

Read Bruce's discussion and gather some information about your problem:
http://thread.gmane.org/gmane.linux.kernel/301432

Then write your own report to the linux kernel mailing list.
Comment 8 Martin Väth 2005-06-01 15:24:08 UTC
(In reply to comment #7) 
> If you can reproduce it on 2.6.12-rc5 then it is an upstream issue, not one 
> caused by gentoo's kernel patches. 
 
Yes, it is not caused by the *kernel* patches. But the problem only happens 
with the Gentoo-compiled kernel: It seems that when I boot my SuSE system and 
chroot to the Gentoo partition, there are no problems (it *might* be 
accidental, but I retried several times, compiling successfully the "usual 
suspects"). 
And today I observed something even stranger: I copied from an old backup the 
kernel generated from gentoo-sources-2.6.9-r14 and it also worked! However, 
after recompiling the *same* version (well, almost: I recompiled 2.6.9-r9 
because the other one is not in the portage tree anymore), 
using /proc/config.gz from the running 2.6.9-r14-configuration (and using 
genkernel), I got a kernel which exhibits the memory leak again! 
I have really no idea how this is possible (but I tried both kernels several 
times, and always the "old" 2.6.9-r14 worked and the "newly compiled" 2.6.9-r9 
failed). My only idea is that my toolchain produces a wrong kernel which, 
however, works perfectly except for this memory leak - this does not sound 
very likely to me. 
I am currently re-bootstrapping my toolchain (using only the most stable 
versions with no optimization) and will then recompile the kernel. When I find 
something new, I will let you know (but I am very busy these days, so it might 
take some time). 
 
Comment 9 Martin Väth 2005-06-02 01:16:43 UTC
Just for the records: No difference with the current stable toolchain. 
Comment 10 Daniel Drake (RETIRED) gentoo-dev 2005-06-02 03:58:54 UTC
It's very unlikely - nothing in userspace can directly cause a kernel memory
leak (but then again, you haven't actually posted any numbers, so it might not
be the kernel that is leaking...)

It's not a fair comparison with suse unless you are running exactly the same
kernel on both. Are you?

There is also no point playing with old kernels like 2.6.9. Reproduce it on the
current development version and provide some numbers to the kernel developers.
Thats the only way this will get solved.
Comment 11 Martin Väth 2005-06-05 09:18:48 UTC
Created attachment 60657 [details]
Output of free, proc/meminfo, proc/slabinfo

This is the output after many "emerge"s when the system is almost swapping dead
for no apparent reason.
Comment 12 Daniel Drake (RETIRED) gentoo-dev 2005-06-05 09:24:28 UTC
You need to post this to the Linux kernel list like Bruce did.
Comment 13 Martin Väth 2005-06-05 09:59:33 UTC
Somehow my additional comment seemed to get lost, so I repost it (sorry if 
this should be doubled now). 
 
 (In reply to comment #12) 
> You need to post this to the Linux kernel list like Bruce did. 
 
I understood what you mean, but as I wrote, the SuSE kernel and the old gentoo 
kernel (from practically the same sources with the same .config) seems to 
work, but a kernel freshly compiled under gentoo does not. So the reason 
probably is not in the gentoo/vanilla-sources but more in its interplay with 
gentoo - to me it is completely mysterious. But if there are no other ideas 
maybe I will write to the kernel list anyway. 
 
(In reply to comment #10) 
> It's very unlikely - nothing in userspace can directly cause a kernel memory 
> leak (but then again, you haven't actually posted any numbers, so it might 
> not be the kernel that is leaking...) 
 
I wrote this thing about the toolchain, because the only explanation for the 
different behaviour for me seems that something is wrong with the compilation 
process itself. But even after re-bootstrapping the toolchain (i.e. 
re-emerging linux-headers,gcc,binutils,glibc sufficiently often) a freshly 
compiled kernel does not work (and I tried several kernel versions - older and 
newer ones). 
 
Concerning the missing data: There are actually two effects which I believe 
have the same cause, but I might be wrong: 
 
1. The only effect which I can provoke is that when compiling 
certain .cc-files with makeopts="-j2" and optimization C*FLAGS usually 
compilation dies with "internal error: killed" (or sometimes also processes of 
other users are killed instead). 
 
2. The other effect happens only after compiling many (~100 or 
more) .cc-projects: The system slows down dramatically with lots of harddisk 
acces and often is practically dead (response time for a keypress maybe 
minutes). The output of comment #11 is from such a situation. 
 
If in 2. the system is not dead, effect 1. happens much more often - that's 
why I believe it is actually the same problem. 
 
> It's not a fair comparison with suse unless you are running exactly the same 
> kernel on both. Are you? 
 
I did not want to compare; but I simply have no explanation: SuSE's and the 
old gentoo kernel (which I now lost due to a stupid mistake) were the only 
"working" kernels which did not show the effect of 1. - instead, they start 
swapping at about the same time during compilation as the new compiled kernels 
(older and newer) would usually start killing random processes. 
 
Comment 14 Daniel Drake (RETIRED) gentoo-dev 2005-06-05 10:17:12 UTC
Regardless of which distro you see a leak on, if the latest unmodified
development kernel (vanilla-sources-2.6.12_rc5) is leaking then it is a kernel
bug. This may be triggered by a scenario present in Gentoo that is not present
in SUSE but no user space program should be able to make the kernel leak (and if
this is the case, then its a kernel bug). If a big leak is triggered in user
space, it is usually regarded as a DoS (denial of service) attack because a
standard user account can easily bring down the box.
Comment 15 Martin Väth 2005-06-18 02:55:01 UTC
I found the main cause: The nvidia-kernel module (the problem occured also 
without X - therefore I had not thought of this cause - but I had the nvidia 
module listed in /etc/modules.autoload.d and my scripts had always compiled 
the module). The earlier gentoo and SuSE kernels used of course different 
nvidia-kernel versions which explains the different behaviour. 
 
With nvidia-kernel-1.0.7664 the reproducible part of the problem has vanished. 
 
Anyway, there still seems to vanish some memory, but currently I have not time 
for further investigations (and it seems hopeless anyway, since the vanishing 
is too slow for systematic experiments).