Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 285901 - System unusable during emerges / makes
Summary: System unusable during emerges / makes
Status: RESOLVED WORKSFORME
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: High normal (vote)
Assignee: Gentoo Linux bug wranglers
URL: many
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-22 06:24 UTC by Robert Bradbury
Modified: 2009-10-07 20:16 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
emerge --info (EmrgInfo.lst,4.10 KB, text/plain)
2009-09-22 07:01 UTC, Robert Bradbury
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Robert Bradbury 2009-09-22 06:24:14 UTC
Large emerges (firefox, glibc, gcc, openoffice (which can take a full day), etc.) which require a lot of small compile sequences make the system effectively unusable from a end-user standpoint.

User programs, such as firefox become unresponsive, and semi-real-time programs (mplayer, alsaplayer) become "jumpy" (drop video frames or stutter).  This is *even* when the emerges (and therefore the make and compile sequences) are run at "nice -19".

Reproducible: Always

Steps to Reproduce:
1. emerge any-large-package (taking 10-minutes to a day of full CPU use)
2. Watch user programs become either painful to use or essentially unusable.

Actual Results:  
User programs become unusable.  For example, scroll bars don't work in Firefox, typing does not echo characters (multi-second delays) in forms (e.g. the bug entry form), alsaplayer stutters on simple mp3 files, mplayer of a TV stream from a PVR-150 starts dropping frames, etc.)

Expected Results:  
The system should be as usable from a user-interface / user-program standpoint when running emerges / makes as it is under non-load conditions.

This is a 2.8 GHz Pentium IV system with 3GB of main memory and the system "spread out" (with respect to caches, swap partitions, temp directories, etc. over 3-4 hard drives).  I can watch the gnome system monitor and there is little disk I/O, little network I/O (except when distribution files are downloading), and available memory.  The CPU use is ~100% but 45-55% (as reported by the monitor and top) is "niced" CPU usage (i.e. make/gcc).  IMO, this is entirely a latency / scheduling problem.  Since Gentoo Linux is commonly used as a Development+Applications system, the "out-of-the-box" configuration should be able to properly handle this situation.

I posed this problem on the LKML and got some responses I didn't understand about CPU scheduling "classes", that I had never encountered before under UNIX/Linux (and would appear to not be a default part of Gentoo emerge system).
Comment 1 Justin Lecher (RETIRED) gentoo-dev 2009-09-22 06:27:33 UTC
Please attach the output of emerge --info.

Did you ever tried using:

PORTAGE_IONICE_COMMAND="ionice -c 3 -p \${PID}" 
PORTAGE_NICENESS="19"

in your make.conf?
Comment 2 Robert Bradbury 2009-09-22 07:01:56 UTC
Created attachment 204886 [details]
emerge --info

As an example in how slow the system is, during a seamonkey emerge with the CPU use @ 100% but 50+% niced it will take 30-40 real time seconds to switch workspaces and redraw an active firefox window.  It takes 30-40 seconds to bring up a "Browse..." sub-window to select the emerge --info file, etc.  Now one "paradox" is that when the system is in this state is that gnome-terminal windows and their shells are still fairly responsive (only a few seconds), so one would wonder if this is a working set / cache size switch problem that is inherent in the Intel architecture?

It might be useful to for people with machines with multiple cores to see if they run into this problem (e.g. emerge seamonkey and try to use firefox at the same time); then try the same experiment with only a single core/cache enabled.  Feedback from someone with an older single core ARM/VIA/non-Intel netbook/notebook might be helpful as well.
Comment 3 Robert Bradbury 2009-09-22 08:01:06 UTC
Justin, I run all of my nightly emerges as well as independent CVS firefox makes at nice -19 (in the shell script which runs the actual commands).  This is confirmed by "top" which shows all cc1plus, as and gmake commands with NI=19 and PR=39.  It is also confirmed by the gnome-panel system-monitor Processor sub-window as having 50-60% of the CPU in nice, 20-30% in user (Xorg, Gnome, firefox and all of the other "normal" system & user daemons, etc.) and the remainder in system (kernel) mode.

Generally disk I/O is minimal over the long period these builds take. Almost all program, libraries, directories, etc. are cached in system buffers or locked in shared memory (3GB is a *lot* for "typical system usage).  I will try the PORTAGE_IONICE_COMAMND for future builds but I doubt it will solve problems that seem to be CPU latency / cache refilling / page set / kernel scheduler related.  Its very simple -- the keyboard, the mouse and the active desktop window (the X server and program(s) attached to that window) should have *absolute* CPU / swap / disk priority over any other processes except critical system daemons (e.g. syslog-ng).  This could be extended to pseudo-real-time programs such as mplayer playing music or displaying a video(TV) stream. The emerges and related sub-processes should only run when there is real "idle" time.

I don't care if my emerges finish at 4 AM or 6 AM.  I do care (a lot) if I have to sit and sip coffee while waiting  for the gentoo bug entry form to display the characters I've typed ahead for multiple lines.  When a system cannot echo characters at the rate that I type there is a real response time problem).

Note, the type-ahead problem does not seem to exist in gnome-terminal, it seems to be a firefox "feature" (but it does not occur when I'm not emerging a package).
Comment 4 Robert Bradbury 2009-09-22 08:28:44 UTC
Side note: the PORTAGE_IONICE_COMMAND may impact a revdep-rebuild load as that is relatively I/O intensive and did kick in (due to cron schedules) towards the tail end of my seamonkey emerge.

Following the completion of the seamonkey emerge AND the revdep-rebuild, the system returned to a 7-20% CPU idle state and Firefox + mplayer returned to "normal" reasonably responsive operation.

Note, that several years ago, I did glance at an IBM technical report that seemed to suggest that a flaw existed in Linux scheduling that clever programming could bypass system nice / priority restrictions.  As I recall it had to do with programs executing within a single (or only a few) clock ticks (scheduler slice?) before "niceness" could take "effect".  In this situation the only solution would be to attempt to decrease the kernel scheduling time slice (so "niced" programs can't effectively become "unniced").  I am worried that high clock rate CPUs (e.g. 2.8 GHz Pentium IV's or overclocked gaming machines) might be running into this problem in that an "in memory" compile might run to completion before Linux scheduling (based on 1/60 - 1/100 sec slices?) kicks in with respect to "nice-ness".  A large number of fast compiles (typical with large emerges) could effectively run "unniced".  Unfortunately I've been unable to locate said technical report.
Comment 5 Justin Lecher (RETIRED) gentoo-dev 2009-09-22 09:03:35 UTC
Hi Robert,
I am really sorry, but I cannot see the bug in your report. I can understand that this behaviour  really sucks, but as long as you cannot point on an exact problem, we cannot help you.
I suggest to inspect your hardware (your problems will likely come from that), try more generic kernels (different schedulers etc.), benchmark your system (check whether cpuload, i/o load produces the bad responsiveness) and perhaps upgrade your hardware.
If you found the problem and it is related to the packages we provide, please report back and reopen the bug.
Comment 6 Robert Bradbury 2009-09-22 19:40:39 UTC
Justin, did you try the test I suggested?  I.e. a large emerge/reemerge (e.g. seamonkey) at the same time trying to enter a bug report in Firefox and play music via mplayer/alsaplayer?

As I attempted to state, the last two activities work *fine* on an unloaded to moderately loaded system.  They only cease to work fine on a loaded system with 100% CPU use (40-60% of that use being "niced").

If you think it is a Linux configuration problem then please state your specific hardware and attach your kernel .config file.  I would also appreciate it if you could ask some of the other Gentoo developers with configurations similar to mine (or those I suggest as helpful from a diagnostic standpoint) to run this test.

I will do my part, as I will attempt at some point to boot the same hardware under Ubuntu and see if the same problems are encountered (a gentoo emerge might be tricky but "making" a firefox source distribution is trivial and will max the CPU for an hour or more).  If the problems still occurs with Ubuntu then it is a kernel architecture [1] and/or hardware problem.  If it turns out that it is a kernel scheduler or misbehavedness of "nice" problem, then the appropriate Gentoo response would be to update any Linux configuration documentation to make note of this and/or distribute the kernel sources with patches that prevent default kernel configurations from exhibiting this behavior.

1. It isn't a "recent" Linux kernel configuration problem as I've noticed the situation for months/years  (though 4+ 2.6.N kernel releases) and generally simply tolerated it (read didn't use the system when emerges were taking place).
Comment 7 Justin Lecher (RETIRED) gentoo-dev 2009-09-22 19:56:59 UTC
Sorry Robert, I have three boxes in daily use a netbook, a hyperthreaded Pentium D and a Quad. On neither of those I had ever problems like you. And I don't use any niceness settings in the make.conf. I would suggest you join #gentoo-portage or #gentoo-bugs on freenode and talk to zmedico about that.
I can assign the bug to the portage team, but they will close it as well.
Comment 8 Jeremy Olexa (darkside) (RETIRED) archtester gentoo-dev Security 2009-09-22 20:06:31 UTC
(In reply to comment #6)

> that it is a kernel scheduler or misbehavedness of "nice" problem, then the
> appropriate Gentoo response would be to update any Linux configuration
> documentation to make note of this and/or distribute the kernel sources with
> patches that prevent default kernel configurations from exhibiting this
> behavior.

There is no one magic answer to this. I'm trying to think of an analogy..

Let's say you have a haybail and a needle that is poking you. Where every individual piece of hay is a different user configuration. You want us to document which piece of hay is good vs bad in the entire bail to solve the needle poking you problem. Impossible, I say.

Ok, maybe a bad analogy.. :)
Comment 9 Robert Bradbury 2009-10-07 20:16:35 UTC
[Warning: if you don't like reading my long messages, then skip this.  If you are having this problem and are interested, this may be of some use.]

Following a response to my post to the ratpoison-devel mailing list, Sam Bobroff has suggested that I reschedule many of the processes using the "chrt" command to set higher SCHED_FIFO or SCHED_RR priorities for time sensitive processes, e.g. mplayer, alsaplayer, Xorg, perhaps some gnome processes and probably the browser.  This could be used hand in hand with running all emerges using "chrt" to run them with only SCHED_IDLE priority.  In theory this only becomes problematic if one has emerges competing against some other SCHED_IDLE processes (e.g. large "blast" or "hmmer" cpu intensive jobs) though one may be able to work around that to some extent by using niced SCHED_IDLE (though I'm not sure if niceness is taken into account when scheduling SCHED_IDLE processes).

What isn't clear to me is why nobody else is encountering this problem?  Is everyone running their emerges on stand-alone or multi-core machines?  Because I don't see how people aren't impacted if they try to use moderately complex desktop loads at the same time they are maxing out the CPU for hours doing emerges.  Or are Gentoo users/developers simply not trying to do nightly upgrades to follow the packages (average perhaps 5-10 per night) at the current release level)?

One may have a problem with emerges (small programs) running up against browsers (large programs) if some of browser pages have been paged out (even if I have 3GB of memory).  If the browsers are constantly being suspended due to page-ins of heap pages or poll() waits (which they do a lot) while the python/gmake/gcc executables are running for their full "quanta", then it may not matter whether the emerges are "niced".  What is involved here is what fraction of a CPU quanta a process uses before another process, perhaps even a lower priority process, gets the CPU.  If things are working out such that browsers (sound players, etc) are only getting 0.2 * quanta of CPU while compiles get 1.0 * quanta of CPU (even though they are niced) then it is going to look like the compiles are hogging the system.  It looks like the only way around this may be to run all of the processes (emerges + browsers) at SCHED_RR priorities so that they always get their full quantum of CPU time.  (See sched_setscheduler(2) if these comments do not make sense -- though I will not claim to fully understand this yet.)