I've hitherto deferred from upgrading to 2.6.9 because, soon after the release,
it became apparent in various circles that there was some weirdness on two
counts that I know of:
1) Swap thrashing/high CPU usage for no apparently good reason
2) OOM killer kicking in where it shouldn't
I also heard (unconfirmed) reports that (depending on the VM overcommit policy
in effect), the kernel could crash hard if physical RAM and swap were
saturated, as opposed to simply killing a memory hogging process.
With regard to the primary two issues, there have been various reports that
I've seen in both the Gentoo Forums and the -ck mailing list (not related to
-ck itself). Then, in the 2.6.9-ck2 announcement, Con Kolivas pointed out this patch:
A nasty bug that caused kswapd to get stuck consuming heaps of cpu which
was in mainline 2.6.9 was tracked down by some of my users (thanks!) and
fixed by Nick Piggin (thanks!).
That patch is available here: http://ck.kolivas.org/patches/2.6/2.6.9/2.6.9-ck2/patches/vm-pages_scanned-active_list.patch
I am less certain as to the precise situation with the OOM killer, but I know of one person who was experiencing a consistent (and unmerited) OOM condition when trying to build UML under 2.6.9 (vanilla) which did not occur under 220.127.116.11. I noticed that Alan Cox is back in action on his -ac patchset (providing "Correct fixes for real problems" as he puts it). Someone kindly split the patches for 2.6.9-ac4 out here:
and the 2.6.9-oom-kill-fix.patch file looks interesting ;) For that matter, _all_ the patches in -ac look interesting (aic-7xxx fix being of particular interest to me) ... perhaps the g-d-s maintainers might consider taking a closer look?
In any case, my main concern is that Nick Piggin's patch makes it into g-d-s if possible.
I'm currently waiting for the patch to make it into upstream 2.6.10 tree, then I'll add it to our patchset. It hasn't been applied by Linus yet. However, there has been a patch applied which looks like it might be the same fix in a different way.. perhaps you could revert the one you posted and see if this one helps:
It was merged earlier today. Will include in future gentoo-dev-sources release.
Thank you very much, both for the rapid response and heads-up. I notice that
you have not marked the bug as closed; if you discover any more information prior
to closure that could be relevant to the issues raised here I would be most
grateful if you could post again on this bug (time, energy and inclination permitting of course as it is of great interest to me, at least ;). Cheers.
It will be closed once we release a new gentoo-dev-sources version containing this patch.
I'm using 2.6.9-gentoo-r8 now and still have this problem.
Especially on a notebook this is a nasty problem.
Don't know if this kernel is already patched.
Can you please define "this problem" - there are a few mentioned on this bug
I've a Dell Inspiron 8000 Notebook with Gentoo on it.
Unfortunately, I need some M$ Win programs so I've installed vmware on it.
Most time I start this virtual machine the kswapd0 process takes lots of CPU
load. The RAM is not fully used. I've added the top of top at the end.
Current kernel is 2.6.9-gentoo-r8
top - 17:19:50 up 2:03, 4 users, load average: 1.21, 1.10, 1.09
Tasks: 126 total, 2 running, 124 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.2% us, 92.6% sy, 0.0% ni, 2.4% id, 1.7% wa, 0.1% hi, 0.1% si
Mem: 514552k total, 469580k used, 44972k free, 7340k buffers
Swap: 987988k total, 36184k used, 951804k free, 360164k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
38 root 25 0 0 0 0 R 95.2 0.0 111:36.02 kswapd0
Could you please test development-sources-2.6.10-rc2 and see if the problem exists there?
Firstly, I wonder if you're using any experimental kernel features such as 4k
stacks or "Use register arguments". Not that I know of any possible side effect,
but 4k stacks in particular change the way in which the VM works. With
proprietary software such as vmware, it's best to stick to a "regular"
configuration for the testing case.
The sources do include the patch mentioned in this bug. Can you confirm that it
is a problem that (1) does not occur in 18.104.22.168 (2) does *or* doesn't occur in
I've started using Alan Cox's 2.6.9 branch as a basis for my kernels because he
seems to be focussing on bug fixing/stabilisation in general. I'd be interested
to know if it happens in 2.6.9-ac11 also (2.6.9-ac12 is experimental by his
Perhaps, if it transpires that it does not occur in one of the other (newer)
branches, it might be worth tring to isolate the change that fixes the problem
and backporting it. Then again, maybe it's one of those corner cases and you
might be better off just waiting for the situation to settle (and using 22.214.171.124
in the meantime).
Another suggestion is to try using the "mapped watermark" patches from the 2.6.9
-ck set, which seem to regulate swap usage pretty effectively (at least for
desktop systems). It's been a while since I've used vmware but I recall that it
stresses the system very hard! It may or may not help.
One other thing: I noticed before that if you're not using a real partition for a host's virtual disk, then vmware seems to be quite sensitive to the filesystem being used. In particular, it really seems to stink with reiserfs! I'm aware that that shouldn't pertain to the swap issue but thought it worthy of mention.
Did some testing with different kernels.
First of all let me say, that I do not use the 4k stack option.
The problem does NOT occur with 126.96.36.199
The problem is still existent with 2.6.10-rc3
Haven't tried the "watermark patches". Will try to do so next week if possible.
vmware is NOT used with its own partition! So comment #12 might be an issue.
I decided to do so because of easier backups...
As this issue is also in upstream, nothing we can do here in the gentoo tree.
Please open a bug at bugzilla.kernel.org for this.
Andre: Perhaps you could try this patch
I think it solves the issue you are describing
Daniel: thanks - that patch is good! I took your gentoo-dev-sources-2.6.9-r12
release and added 5 good patches that were applied upstream at some point or
another, with the exception of the first:
* The "1G lowmem" patch from -ck (well, I have exactly 1G RAM).
* The aforementioned "include total_scanned" patch from Andrew Morton.
* A fix from Jens Axboe to prevent blk_recalc_rq_segments from indulging in bad
segment coalescing (due to not taking ->max_segment_size into account).
* A fix from Arjan van de Ven to change the "hysteresis" for the queue
congestion to be an additional 1/16th of the number of requests.
* A fix from Marcelo Tossati to limit the amount of memory which is under
pageout writeout to be a little more than the amount of memory at which
balance_dirty_pages() callers will synchronously throttle. Apparently, this prevents a simple dd operation from driving the system nuts.
Despite 2.6.9 being the only 2.6 kernel ever to cause a catastrophic crash here
(on the first occasion that I tried it), I took the plunge and rebooted my main
server with this kernel. It's not been up for long yet, hence I am still
keeping a close eye on things. Nonetheless, performance is great and none of
the usual oddities that I have come to associate with 2.6.9 have made
themselves apparent - particularly with respect to the numerous OOM/swap/VM
issues that have been under broad discussion of late.
Having said that, later posts in the thread that you linked to still hint at
problems under certain circumstances. Once can only hope that they've been
adequately resolved in the newly released 2.6.10 ;) Anyway, thanks for your
insights - Merry Christmas.
The pages_scanned fix is now in Linus' tree and will be included in gentoo-dev-sources-2.6.10-r3
Splendid news and not before time I might add ;)
Thanks for the update.