Summary: | randomly a process hogs a CPU indefinitely | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | BlinkEye <BlinkEye> |
Component: | [OLD] Unspecified | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | RESOLVED NEEDINFO | ||
Severity: | major | CC: | mike, tim |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | AMD64 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: | taskstate requested earlier in bug report |
Description
BlinkEye
2008-08-28 15:50:56 UTC
looks like a similar problem to: http://kerneltrap.org/mailarchive/linux-kernel/2007/10/19/348832 but there was no solution provided ... It's happening again - this time with 'touch' of the simple shell script above: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16692 foobar 20 0 7816 408 312 R 100 0.0 6:03.72 touch The kernel link above suggested to cat the stat of the process: # while true; do cat /proc/16692/stat ; sleep 5; done 16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 17803 2 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0 16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 18303 2 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0 16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 18803 2 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0 16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 19303 2 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315432 0 0 0 0 0 0 0 17 3 0 0 0 0 0 16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 19802 3 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0 16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 20302 3 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0 16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 20802 3 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315441 0 0 0 0 0 0 0 17 3 0 0 0 0 0 16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 21302 3 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0 16692 (touch) R 22679 22677 7314 34821 22677 4194304 180 0 0 0 21802 3 0 0 20 0 1 0 15147700 8003584 102 18446744073709551615 4194304 4242292 140736481217664 18446744073709551615 140401332315429 0 0 0 0 0 0 0 17 3 0 0 0 0 0 12 minutes later the shell script continues. Here's the output of strace: # strace -p 16692 Process 16692 attached - interrupt to quit brk(0) = 0x60d000 brk(0x62e000) = 0x62e000 close(0) = 0 open("./file66.txt", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 0 utimensat(0, NULL, NULL, 0) = 0 close(0) = 0 close(1) = 0 close(2) = 0 exit_group(0) = ? Process 16692 detached Tried gentoo-sources-2.6.26-r1 - happens too. (In reply to comment #4) > Tried gentoo-sources-2.6.26-r1 - happens too. > Just a couple stabs in the dark, but you could try following forks in strace. I tend to do something like: strace -f -s 4096 -T -p <PID> This will follow all forks, show you a little more data (optional), and give you timing information on each syscall (cool feature IMO). Also, how does your disk look while you are waiting? I agree that 10+ minutes is too long in any case, but an iostat against your device would be interesting. (In reply to comment #4) > Tried gentoo-sources-2.6.26-r1 - happens too. > Couldn't reproduce your bug. I'd also like to see the strace to see if it waits for anything. I would try running your bash script against another file system, against another hard disk and maybe even against tmpfs mount -t tmpfs tmpfs -o size=10M /mnt/tmpfs Also, could you post your .config? Please enable CONFIG_MAGIC_SYSRQ and increase CONFIG_LOG_BUF_SHIFT by 1 or 2 so that it's plenty big enough to store a lot of data. Then, when you see this bug reproduced (a process 'frozen' and hogging CPU) please note down its PID and then run: echo t > /proc/sysrq-trigger dmesg > taskstate.txt Then attach taskstate.txt to this bug, and tell us the noted PID. This will indicate exactly where the kernel is stuck. Thanks! Please reopen with the info requested above. Created attachment 181057 [details]
taskstate requested earlier in bug report
I have this same problem. Currently on my system the following command is taking up 100% of the CPU: cat /proc/cpuinfo It has been running for 13+ minutes. Executing this command in another terminal works just fine. I've seen this happen with cat, sed, and grep. I ran the commands suggested and dumped taskstate.txt. It is attached. Comment on attachment 181057 [details]
taskstate requested earlier in bug report
The relevant PID for "cat /proc/sysinfo" is 20935
Reopening since another user has the same problem and provided necessary/desired information. Okay, since this is one of the oldest active bugs in here it's time for some attention. Can you, gentlemen, reproduce the issue even with the latest stable gentoo-sources (gentoo-sources-2.6.30-r5)? This is some form of heisenbug. I haven't seen it in a long time now. I'll be updating my kernel next week so I can post something more substantial then. I'm gonna close this bug as NEEDINFO. If you ever get the same issue, feel free to update this report. |