69076 – gentoo-dev-sources-2.6.9 does not rectify nasty VM/kswapd issue in mainline

Bug 69076 - gentoo-dev-sources-2.6.9 does not rectify nasty VM/kswapd issue in mainline

Summary: gentoo-dev-sources-2.6.9 does not rectify nasty VM/kswapd issue in mainline

Status:	VERIFIED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	High major (vote)
Assignee:	Daniel Drake (RETIRED)

URL:	http://ck.kolivas.org/patches/2.6/2.6...
Whiteboard:
Keywords:	InVCS

Depends on:
Blocks:

Reported:	2004-10-26 15:30 UTC by kfm
Modified:	2005-01-04 01:12 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description kfm 2004-10-26 15:30:34 UTC

Hi,

I've hitherto deferred from upgrading to 2.6.9 because, soon after the release,
it became apparent in various circles that there was some weirdness on two
counts that I know of:

1) Swap thrashing/high CPU usage for no apparently good reason
2) OOM killer kicking in where it shouldn't

I also heard (unconfirmed) reports that (depending on the VM overcommit policy
in effect), the kernel could crash hard if physical RAM and swap were
saturated, as opposed to simply killing a memory hogging process.

With regard to the primary two issues, there have been various reports that
I've seen in both the Gentoo Forums and the -ck mailing list (not related to
-ck itself). Then, in the 2.6.9-ck2 announcement, Con Kolivas pointed out this patch:

    +vm-pages_scanned-active_list.patch
  A nasty bug that caused kswapd to get stuck consuming heaps of cpu which
  was in mainline 2.6.9 was tracked down by some of my users (thanks!) and
  fixed by Nick Piggin (thanks!).

That patch is available here: http://ck.kolivas.org/patches/2.6/2.6.9/2.6.9-ck2/patches/vm-pages_scanned-active_list.patch

I am less certain as to the precise situation with the OOM killer, but I know of one person who was experiencing a consistent (and unmerited) OOM condition when trying to build UML under 2.6.9 (vanilla) which did not occur under 2.6.8.1. I noticed that Alan Cox is back in action on his -ac patchset (providing "Correct fixes for real problems" as he puts it). Someone kindly split the patches for 2.6.9-ac4 out here:

  http://kem.p.lodz.pl/~peter/2.6.9-ac/

and the 2.6.9-oom-kill-fix.patch file looks interesting ;) For that matter, _all_ the patches in -ac look interesting (aic-7xxx fix being of particular interest to me) ... perhaps the g-d-s maintainers might consider taking a closer look?

In any case, my main concern is that Nick Piggin's patch makes it into g-d-s if possible.

Comment 1 Daniel Drake (RETIRED) gentoo-dev

2004-10-26 15:50:45 UTC

I'm currently waiting for the patch to make it into upstream 2.6.10 tree, then I'll add it to our patchset. It hasn't been applied by Linus yet. However, there has been a patch applied which looks like it might be the same fix in a different way.. perhaps you could revert the one you posted and see if this one helps:

http://linux.bkbits.net:8080/linux-2.6/diffs/mm/vmscan.c@1.231?nav=index.html|src/|src/mm|hist/mm/vmscan.c

Comment 2 Daniel Drake (RETIRED) gentoo-dev

2004-10-28 12:56:59 UTC

http://linux.bkbits.net:8080/linux-2.6/cset@1.2263

It was merged earlier today. Will include in future gentoo-dev-sources release.

Comment 3 kfm 2004-10-29 06:09:53 UTC

Thank you very much, both for the rapid response and heads-up. I notice that
you have not marked the bug as closed; if you discover any more information prior
to closure that could be relevant to the issues raised here I would be most
grateful if you could post again on this bug (time, energy and inclination permitting of course as it is of great interest to me, at least ;). Cheers.

Comment 4 Daniel Drake (RETIRED) gentoo-dev

2004-10-29 10:57:44 UTC

It will be closed once we release a new gentoo-dev-sources version containing this patch.

Comment 5 Daniel Drake (RETIRED) gentoo-dev

2004-10-31 13:44:44 UTC

In gentoo-dev-sources-2.6.9-r2

Comment 6 kfm 2004-11-02 07:44:31 UTC

Thanks, Daniel.

Comment 7 Andre Hinrichs 2004-12-03 07:51:24 UTC

I'm using 2.6.9-gentoo-r8 now and still have this problem.
Especially on a notebook this is a nasty problem.
Don't know if this kernel is already patched.

Comment 8 Daniel Drake (RETIRED) gentoo-dev

2004-12-03 07:59:35 UTC

Can you please define "this problem" - there are a few mentioned on this bug

Comment 9 Andre Hinrichs 2004-12-03 08:25:43 UTC

Sure.
I've a Dell Inspiron 8000 Notebook with Gentoo on it.
Unfortunately, I need some M$ Win programs so I've installed vmware on it.
Most time I start this virtual machine the kswapd0 process takes lots of CPU
load. The RAM is not fully used. I've added the top of top at the end.
Current kernel is 2.6.9-gentoo-r8


top - 17:19:50 up  2:03,  4 users,  load average: 1.21, 1.10, 1.09
Tasks: 126 total,   2 running, 124 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.2% us, 92.6% sy,  0.0% ni,  2.4% id,  1.7% wa,  0.1% hi,  0.1% si
Mem:    514552k total,   469580k used,    44972k free,     7340k buffers
Swap:   987988k total,    36184k used,   951804k free,   360164k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
   38 root      25   0     0    0    0 R 95.2  0.0 111:36.02 kswapd0

Comment 10 Daniel Drake (RETIRED) gentoo-dev

2004-12-03 08:40:05 UTC

Could you please test development-sources-2.6.10-rc2 and see if the problem exists there?

Comment 11 kfm 2004-12-03 08:47:51 UTC

Firstly, I wonder if you're using any experimental kernel features such as 4k
stacks or "Use register arguments". Not that I know of any possible side effect,
 but 4k stacks in particular change the way in which the VM works. With
proprietary software such as vmware, it's best to stick to a "regular"
configuration for the testing case.

The sources do include the patch mentioned in this bug. Can you confirm that it
is a problem that (1) does not occur in 2.6.8.1 (2) does *or* doesn't occur in
2.6.10-rc2?

I've started using Alan Cox's 2.6.9 branch as a basis for my kernels because he
seems to be focussing on bug fixing/stabilisation in general. I'd be interested
to know if it happens in 2.6.9-ac11 also (2.6.9-ac12 is experimental by his
standards).

Perhaps, if it transpires that it does not occur in one of the other (newer)
branches, it might be worth tring to isolate the change that fixes the problem
and backporting it. Then again, maybe it's one of those corner cases and you
might be better off just waiting for the situation to settle (and using 2.6.8.1
in the meantime).

Another suggestion is to try using the "mapped watermark" patches from the 2.6.9
-ck set, which seem to regulate swap usage pretty effectively (at least for
desktop systems). It's been a while since I've used vmware but I recall that it
stresses the system very hard! It may or may not help.

---

http://ck.kolivas.org/patches/2.6/2.6.9/2.6.9-ck3/patches/mwII.diff
http://ck.kolivas.org/patches/2.6/2.6.9/2.6.9-ck3/patches/mwII-oc.diff

Comment 12 kfm 2004-12-03 08:51:53 UTC

One other thing: I noticed before that if you're not using a real partition for a host's virtual disk, then vmware seems to be quite sensitive to the filesystem being used. In particular, it really seems to stink with reiserfs! I'm aware that that shouldn't pertain to the swap issue but thought it worthy of mention.

Comment 13 Andre Hinrichs 2004-12-05 07:32:49 UTC

Did some testing with different kernels.
First of all let me say, that I do not use the 4k stack option.
The problem does NOT occur with 2.6.8.1
The problem is still existent with 2.6.10-rc3
Haven't tried the "watermark patches". Will try to do so next week if possible.

vmware is NOT used with its own partition! So comment #12 might be an issue.
I decided to do so because of easier backups...

Comment 14 Greg Kroah-Hartman (RETIRED) gentoo-dev

2004-12-06 09:22:59 UTC

As this issue is also in upstream, nothing we can do here in the gentoo tree.

Please open a bug at bugzilla.kernel.org for this.

Comment 15 Daniel Drake (RETIRED) gentoo-dev

2004-12-23 05:15:14 UTC

Andre: Perhaps you could try this patch
http://marc.theaimsgroup.com/?l=linux-kernel&m=110357628419245&w=2

I think it solves the issue you are describing

Comment 16 kfm 2004-12-24 18:05:50 UTC

Daniel: thanks - that patch is good! I took your gentoo-dev-sources-2.6.9-r12
release and added 5 good patches that were applied upstream at some point or
another, with the exception of the first:

* The "1G lowmem" patch from -ck (well, I have exactly 1G RAM).

* The aforementioned "include total_scanned" patch from Andrew Morton.

* A fix from Jens Axboe to prevent blk_recalc_rq_segments from indulging in bad
segment coalescing (due to not taking ->max_segment_size into account).

* A fix from Arjan van de Ven to change the "hysteresis" for the queue
congestion to be an additional 1/16th of the number of requests.

* A fix from Marcelo Tossati to limit the amount of memory which is under
pageout writeout to be a little more than the amount of memory at which
balance_dirty_pages() callers will synchronously throttle. Apparently, this prevents a simple dd operation from driving the system nuts.

Despite 2.6.9 being the only 2.6 kernel ever to cause a catastrophic crash here
(on the first occasion that I tried it), I took the plunge and rebooted my main
server with this kernel. It's not been up for long yet, hence I am still
keeping a close eye on things. Nonetheless, performance is great and none of
the usual oddities that I have come to associate with 2.6.9 have made
themselves apparent - particularly with respect to the numerous OOM/swap/VM
issues that have been under broad discussion of late.

Having said that, later posts in the thread that you linked to still hint at
problems under certain circumstances. Once can only hope that they've been
adequately resolved in the newly released 2.6.10 ;) Anyway, thanks for your
insights - Merry Christmas.

Comment 17 Daniel Drake (RETIRED) gentoo-dev

2005-01-03 23:44:30 UTC

The pages_scanned fix is now in Linus' tree and will be included in gentoo-dev-sources-2.6.10-r3

Comment 18 kfm 2005-01-04 01:12:23 UTC

Splendid news and not before time I might add ;)

Thanks for the update.