Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 236038 - randomly a process hogs a CPU indefinitely
Summary: randomly a process hogs a CPU indefinitely
Status: RESOLVED NEEDINFO
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Unspecified (show other bugs)
Hardware: AMD64 Linux
: High major (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-08-28 15:50 UTC by BlinkEye
Modified: 2009-09-16 13:47 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
taskstate requested earlier in bug report (taskstate.txt,29.92 KB, text/plain)
2009-02-05 17:19 UTC, Tim Mattison
Details

Note You need to log in before you can comment on or make changes to this bug.
Description BlinkEye 2008-08-28 15:50:56 UTC
randomly a process hogs the CPU for a very long time and then continues. the latest example is while compiling wireshark - amidst the process, 'sed' hangs for more than 17 minutes:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7760 root      39  19  7768  360  256 R  100  0.0  16:57.61 sed

but after that, the compilation succeeds anyway. I tried to identify the state sed was in:

# ps ax -T -o ppid,pid,args,stat,wchan | grep sed
 7756  7760 /bin/sed -e 1s/^X// -e s/\( RN+  -

but that doesn't help me much. I've got 4 CPU's and didn't do anything besides waiting for emerge to finish. 

I don't know how to reproduce this issue, but the last 2 days it was happening with a bash script I used for stress-testing inotify, like:

#!/bin/sh

# very strange. this script just hangs randomly on an up-to-date gentoo.
# be it in a `seq` call (top confirms that) or in `touch`.
# btw - touch seriously degrades performance!!!

OUTPUT=.

for j in `seq 0 9999`; do
   for i in `seq 0 49`; do
#       touch $OUTPUT/file$i.txt
      echo 1 > $OUTPUT/file$i.txt
      echo 1 > $OUTPUT/file$i.txt
      echo 1 > $OUTPUT/file$i.txt
   done
done

Randomly 'seq' or 'touch' were hogging up one of the CPUs - but would continue after a very long pause.

I've compiled the system from scratch (emerge -e) at least 2 times during the last 3 weeks and was never using any fancy CFLAGS anyway. I was running Debian testing for 4 month beforehand on the very same hardware and did not have such a behaviour.


Reproducible: Sometimes

Steps to Reproduce:
1. run some cpu intensive program
2. wait indefinitely until a program hogs the cpu
3.

Actual Results:  
randomly gets stuck and hogs CPU - but will continue after a very long time.


Portage 2.1.4.4 (default/linux/amd64/2008.0/desktop, gcc-4.1.2, glibc-2.6.1-r0, 2.6.25-gentoo-r7 x86_64)
=================================================================
System uname: 2.6.25-gentoo-r7 x86_64 Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Timestamp of tree: Thu, 28 Aug 2008 07:04:01 +0000
ccache version 2.4 [enabled]
app-shells/bash:     3.2_p33
dev-lang/python:     2.5.2-r6
dev-python/pycrypto: 2.0.1-r6
dev-util/ccache:     2.4-r7
sys-apps/baselayout: 1.12.11.1
sys-apps/sandbox:    1.2.18.1-r2
sys-devel/autoconf:  2.13, 2.61-r2
sys-devel/automake:  1.5, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10.1
sys-devel/binutils:  2.18-r3
sys-devel/gcc-config: 1.4.0-r4
sys-devel/libtool:   1.5.26
virtual/os-headers:  2.6.23-r3
ACCEPT_KEYWORDS="amd64"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-mtune=nocona -O2 -pipe -ggdb"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/config"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/revdep-rebuild /etc/terminfo /etc/texmf/web2c /etc/udev/rules.d"
CXXFLAGS="-mtune=nocona -O2 -pipe -ggdb"
DISTDIR="/usr/portage/distfiles"
FEATURES="buildpkg ccache distlocks fixpackages metadata-transfer parallel-fetch sandbox sfperms splitdebug strict unmerge-orphans userfetch"
GENTOO_MIRRORS="ftp://mirror.switch.ch/mirror/gentoo/ ftp://mirror.ing.unibo.it/gentoo/ ftp://mirror.qubenet.net/mirror/gentoo/ ftp://ftp-stud.fht-esslingen.de/pub/Mirrors/gentoo/ ftp://pandemonium.tiscali.de/pub/gentoo/"
LC_ALL="en_US"
LDFLAGS="-Wl,-O1"
LINGUAS="en de"
MAKEOPTS="-j5"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/portage/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/portage/local/layman/desktop-effects"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="X acl acpi alsa amd64 bash-completion berkdb bluetooth branding bzip2 cairo cdr cli cracklib crypt cups dbus dri dvd dvdr dvdread eds emboss encode esd evo fam firefox fortran gdbm gif gpm gstreamer hal iconv ipv6 isdnlog jpeg kde kerberos libnotify lm_sensors locales mad midi mikmod mmx mp3 mpeg mudflap multilib ncurses nls nptl nptlonly ogg opengl openmp pam pcre pdf perl png ppds pppd python qt qt3 qt3support qt4 quicktime readline reflection sdl session spell spl sse sse2 ssl startup-notification svg sysfs tcpd tiff truetype unicode usb userlocales utf8 vim-syntax vorbis xcomposite xinerama xml xorg xv zlib" ALSA_CARDS="hda_intel" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" ELIBC="glibc" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en de" USERLAND="GNU" VIDEO_CARDS="fglrx"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, LANG, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
Comment 1 BlinkEye 2008-08-29 07:40:40 UTC
looks like a similar problem to: http://kerneltrap.org/mailarchive/linux-kernel/2007/10/19/348832

but there was no solution provided ...
Comment 2 BlinkEye 2008-08-29 07:54:24 UTC
It's happening again - this time with 'touch' of the simple shell script above:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
16692 foobar  20   0  7816  408  312 R  100  0.0   6:03.72 touch


The kernel link above suggested to cat the stat of the process:

# while true; do cat /proc/16692/stat ; sleep 5; done
16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 17803 2 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0
16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 18303 2 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0
16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 18803 2 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0
16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 19303 2 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315432 0 0 0 0 0 0 0 17 3 0 0 0 0 0
16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 19802 3 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0
16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 20302 3 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0
16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 20802 3 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315441 0 0 0 0 0 0 0 17 3 0 0 0 0 0
16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 21302 3 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0
16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 21802 3 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0
Comment 3 BlinkEye 2008-08-29 07:56:57 UTC
12 minutes later the shell script continues. Here's the output of strace:

# strace -p 16692
Process 16692 attached - interrupt to quit
brk(0)                                  = 0x60d000
brk(0x62e000)                           = 0x62e000
close(0)                                = 0
open("./file66.txt", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 0
utimensat(0, NULL, NULL, 0)             = 0
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
Process 16692 detached
Comment 4 BlinkEye 2008-08-29 08:56:14 UTC
Tried gentoo-sources-2.6.26-r1 - happens too.
Comment 5 Michael Spiegle 2008-12-02 23:04:06 UTC
(In reply to comment #4)
> Tried gentoo-sources-2.6.26-r1 - happens too.
> 

Just a couple stabs in the dark, but you could try following forks in strace.  I tend to do something like:

strace -f -s 4096 -T -p <PID>

This will follow all forks, show you a little more data (optional), and give you timing information on each syscall (cool feature IMO).

Also, how does your disk look while you are waiting?  I agree that 10+ minutes is too long in any case, but an iostat against your device would be interesting.
Comment 6 Eddie Mishelevich 2008-12-07 09:08:02 UTC
(In reply to comment #4)
> Tried gentoo-sources-2.6.26-r1 - happens too.
> 

Couldn't reproduce your bug.
I'd also like to see the strace to see if it waits for anything.

I would try running your bash script against another file system, against another hard disk and maybe even against tmpfs

mount -t tmpfs tmpfs -o size=10M /mnt/tmpfs

Also, could you post your .config? 
Comment 7 Daniel Drake (RETIRED) gentoo-dev 2008-12-07 13:33:30 UTC
Please enable CONFIG_MAGIC_SYSRQ and increase CONFIG_LOG_BUF_SHIFT by 1 or 2 so that it's plenty big enough to store a lot of data.

Then, when you see this bug reproduced (a process 'frozen' and hogging CPU) please note down its PID and then run:
   echo t > /proc/sysrq-trigger
   dmesg > taskstate.txt

Then attach taskstate.txt to this bug, and tell us the noted PID. This will indicate exactly where the kernel is stuck. Thanks!
Comment 8 Daniel Drake (RETIRED) gentoo-dev 2008-12-24 18:08:01 UTC
Please reopen with the info requested above.
Comment 9 Tim Mattison 2009-02-05 17:19:35 UTC
Created attachment 181057 [details]
taskstate requested earlier in bug report
Comment 10 Tim Mattison 2009-02-05 17:20:09 UTC
I have this same problem.  Currently on my system the following command is taking up 100% of the CPU:

cat /proc/cpuinfo

It has been running for 13+ minutes.  Executing this command in another terminal works just fine.  I've seen this happen with cat, sed, and grep.  I ran the commands suggested and dumped taskstate.txt.  It is attached.
Comment 11 Tim Mattison 2009-02-05 17:24:02 UTC
Comment on attachment 181057 [details]
taskstate requested earlier in bug report

The relevant PID for "cat /proc/sysinfo" is 20935
Comment 12 BlinkEye 2009-02-09 20:14:19 UTC
Reopening since another user has the same problem and provided necessary/desired information.
Comment 13 George Kadianakis (RETIRED) gentoo-dev 2009-09-01 04:13:47 UTC
Okay, since this is one of the oldest active bugs in here it's time for some attention.

Can you, gentlemen, reproduce the issue even with the latest stable gentoo-sources (gentoo-sources-2.6.30-r5)?
Comment 14 Tim Mattison 2009-09-01 13:32:38 UTC
This is some form of heisenbug.  I haven't seen it in a long time now.  I'll be updating my kernel next week so I can post something more substantial then.
Comment 15 George Kadianakis (RETIRED) gentoo-dev 2009-09-16 13:47:08 UTC
I'm gonna close this bug as NEEDINFO.
If you ever get the same issue, feel free to update this report.