236237 – gentoo-sources 2.6.25-r7 (and many earlier versions) appear to leak sysfs_dir_cache and size-32 structures

Bug 236237 - gentoo-sources 2.6.25-r7 (and many earlier versions) appear to leak sysfs_dir_cache and size-32 structures

Summary: gentoo-sources 2.6.25-r7 (and many earlier versions) appear to leak sysfs_dir...

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	High major
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-08-31 00:44 UTC by Stephan Sokolow
Modified:	2008-12-24 18:09 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Kernel config, fresh from /proc/config.gz (config.gz,12.25 KB, text/plain) 2008-08-31 06:34 UTC, Stephan Sokolow	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stephan Sokolow 2008-08-31 00:44:44 UTC

I am currently running gentoo-sources 2.6.25-r7 and, since some time during my use of 2.6.23 or 2.6.24 (I don't remember which), I've been experiencing what appears to be a kernel memory leak.

Over the course of a week or two, the memory usage slowly but steadily climbs until my machine starts swapping constantly and it may take over a minute just to get UI responses so I can reboot it. 

Keep in mind that I have 4GiB of RAM which I acquired as part of a (still un-started) plan to experiment with virtualization, so I suspect a more typical machine would have to reboot after less than a week.

Using top and/or free's -/+ buffers/cache line, I can confirm that memory usage remains at the 3GiB+ mark even when I've killed off everything except the bare minimum. (/usr/bin/init, six copies of agetty, a copy of /usr/bin/login, a copy of /bin/sh, and one or two other processes like udevd which I don't know how to safely kill)

Here are the first five entries in the output of slabtop:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
15459451 15459430  99%    0.10K 417823       37   1671292K sysfs_dir_cache
10821639 10821604  99%    0.05K 161517       67    646068K size-32
 58021  52384  90%    0.23K   3413       17     13652K dentry
 48870  46212  94%    0.12K   1629       30      6516K buffer_head
 47285  45109  95%    0.77K   9457        5     37828K ext3_inode_cache

I'm willing to help however possible, but I don't know where to go from here and most of my programming experience is in Python, bash script, and PHP, so I can't really poke around in the kernel source.

Reproducible: Always

Steps to Reproduce:

Comment 1 Stephan Sokolow 2008-08-31 00:47:35 UTC

Oh, I forgot to mention. I did google around for a while, but beyond enabling CONFIG_DEBUG_SLAB_LEAK and discovering slabtop, I wasn't really able to find anything helpful.

I did discover kmemleak, but the newest patch is for 2.6.20-rc1 and I didn't want to try getting it to apply while I still had safer options. (Given that I have to perform my usual day-to-day activities on this thing)

Comment 2 Wormo (RETIRED) gentoo-dev

2008-08-31 06:07:05 UTC

Wow, that is a huge amount of memory, especially for sysfs_dir_cache

Please post your kernel config, as some driver that you use is the most likely culprit. 

Also, for a debugging strategy: 
Start with your bare minimum of services and use slabtop to make sure memory is not increasing. 
Then start turning services on one by one while keeping an eye on slabtop, and see when usage starts to climb.

Comment 3 Stephan Sokolow 2008-08-31 06:34:07 UTC

Created attachment 164156 [details]
Kernel config, fresh from /proc/config.gz

In addition to this kernel config, I also have the nVidia binary drivers, LIRC, gspca, and the zaptel driver... though the problem has been occurring for a while and the zaptel driver was only added recently.

Comment 4 Stephan Sokolow 2008-08-31 06:35:36 UTC

Oops. Sorry about that. I'm used to bugzilla setups which autodetect the mimetype. You'll have to manually gunzip it.

Comment 5 Stephan Sokolow 2008-08-31 07:55:59 UTC

I've identified one of the triggers for the problem. When I killed sanebuttonsd (from kscannerbuttons in my local overlay), the leak stopped.

However, I know it wasn't the only one because the leak was going on before I added sanebuttonsd, so something I killed before sanebuttonsd must also be triggering the leak. (On the plus side, at least I know that sanebuttonsd is a major contributor to the problem, accounting for exactly 111 leaked sysfs_dir_cache structures per slabtop update interval)

I may have next to no experience with C and C++, but I'll see if I can find time to take a look inside sanebuttonsd some time in the next few days. Given how consistently precise the leak rate is, I suspect whatever system call is leaking (or poorly designed, but I hope not because that's a lot harder to get fixed) is being called from inside a polling loop.

Comment 6 Stephan Sokolow 2008-08-31 08:21:05 UTC

Oh, I forgot to mention. I also tried building my kernel with SLUB instead of SLAB a few weeks ago and there was no change in behaviour.

Comment 7 Wormo (RETIRED) gentoo-dev

2008-09-01 06:50:28 UTC

Looks like a related problem was reported against scanbuttond...
http://www.uwsg.iu.edu/hypermail/linux/kernel/0708.2/2879.html

Wonder if you guys are using the same scanner; apparently it's not all USB scanners because Andrew Morton failed to reproduce this bug with his scanner

Comment 8 Stephan Sokolow 2008-09-01 08:36:03 UTC

No clue what the other guy's using, but I'm using a Canon CanoScan LiDE20 flatbed. (plustek driver)

Of course, there's just as much chance that it's something else which differs between our setups and Andrew Morton's. Definitely one of the more annoying parts of computing technology.

My summer vacation just ended, so I'm not sure how long it'll take me, but I'll see if I can find time to poke around in the scanbuttond source code at some point.

Comment 9 Mike Pagano gentoo-dev

2008-10-19 19:22:59 UTC

Have you seen this happen with gentoo-sources-2.6.27

Comment 10 Stephan Sokolow 2008-10-19 20:11:16 UTC

I usually wait for gentoo-sources to go stable first, so I'm still on 2.6.25-r7. Also, I disabled sanebuttonsd because I've needed long runtimes without memory leaking.

I'll try to clear some time in the next week or two to test it out.

Comment 11 Daniel Drake (RETIRED) gentoo-dev

2008-11-27 20:37:29 UTC

So you are running scanbuttond as well?
Does the leak stop if you stop running scanbuttond?

Comment 12 Stephan Sokolow 2008-11-28 00:52:21 UTC

I'm not running either at the moment and haven't been since my last reboot. (50 days ago) I value uptime a lot more than scanner buttons.

Comment 13 Robert Lewis 2008-12-03 01:30:05 UTC

(In reply to comment #10)
> I usually wait for gentoo-sources to go stable first, so I'm still on
> 2.6.25-r7. Also, I disabled sanebuttonsd because I've needed long runtimes
> without memory leaking.
> 
> I'll try to clear some time in the next week or two to test it out.
> 

Could you paste your emerge --info or say what arch you are using?

gources-2.6.entoo-s26-r3 is marked as stable under x86 and amd64.Please see if you can reproduce the bug with this kernel.

Comment 14 Robert Lewis 2008-12-03 01:33:33 UTC

(In reply to comment #13)
> (In reply to comment #10)
> > I usually wait for gentoo-sources to go stable first, so I'm still on
> > 2.6.25-r7. Also, I disabled sanebuttonsd because I've needed long runtimes
> > without memory leaking.
> > 
> > I'll try to clear some time in the next week or two to test it out.
> > 
> 
> Could you paste your emerge --info or say what arch you are using?
> 
> gources-2.6.entoo-s26-r3 is marked as stable under x86 and amd64.Please see if
> you can reproduce the bug with this kernel.
> 

That did not come out right for some reason: gentoo-sources-2.6.26-r3

Comment 15 Stephan Sokolow 2008-12-03 02:13:04 UTC

I'm on amd64 stable and 2.6.26 has been stable for a little while now, but I messed up my time management and I'm currently rushing to get my assignments in and my materials studied in prep for exams, so the absolute earliest I can allocate time to configure a new kernel and reboot my system is December 19th... possibly as late as January 1st.

I'll leave the e-mail notification of your request in my inbox as a TODO note and get to it then.

Comment 16 Daniel Drake (RETIRED) gentoo-dev

2008-12-24 18:09:58 UTC

Please reopen when you have time to test the latest kernel, which will be 2.6.28 very soon, or 2.6.29-rc1 in about 2 weeks time.