Summary: | detect orphaned processes (sys-devel/gcc:hang during emerging) | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | yegle <cnyegle> |
Component: | [OLD] Development | Assignee: | Portage team <dev-portage> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | andrey.vihrov, esigra, luciano, stefan.andreas.bauer |
Priority: | High | Keywords: | InVCS |
Version: | unspecified | ||
Hardware: | x86 | ||
OS: | Linux | ||
See Also: | https://bugs.gentoo.org/show_bug.cgi?id=919072 | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Bug Depends on: | |||
Bug Blocks: | 184128, 257279, 335925 | ||
Attachments: |
qlop -l output
make ebuild.sh clean up orphaned processes |
Description
yegle
2009-07-24 02:59:18 UTC
Would you do 'ps axf' when it is hung, and show all child processes of your emerge -- there really ought to be more than just the defunct sandbox? Also, try 'strace -p <emerge pid> -o TRACE' to look for clues about what the parent process is doing. There is only sandbox whose parent is emerge as far as I can remember I'm emerging gcc again, see if I can give you more hints. (In reply to comment #1) > Would you do 'ps axf' when it is hung, and show all child processes of your > emerge -- there really ought to be more than just the defunct sandbox? > > Also, try 'strace -p <emerge pid> -o TRACE' to look for clues about what the > parent process is doing. > (In reply to comment #1) > Would you do 'ps axf' when it is hung, and show all child processes of your > emerge -- there really ought to be more than just the defunct sandbox? > > Also, try 'strace -p <emerge pid> -o TRACE' to look for clues about what the > parent process is doing. > ps axf result: 7873 ? Sl 4:00 yakuake -session 1014cd7d2d4000124377837600000039660015_1248640792_778987 20977 pts/5 Ss 0:00 \_ /bin/bash 13750 pts/5 SN+ 0:11 | \_ /usr/bin/python /usr/bin/emerge gcc -1 14965 pts/5 ZN+ 0:00 | \_ [sandbox] <defunct> 11425 pts/6 Ss 0:00 \_ /bin/bash 11440 pts/6 R+ 0:00 \_ ps axf $ sudo strace -p 13750 Process 13750 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...> Ok how about another test, as I try to figure out who should get assigned this interesting problem (portage, sandbox, toolchain devs)... Run your emerge with --debug, save the output to a file, and attach here Also, could you post output of 'qlop -l' to show what got upgraded before you got stuck on gcc emerge Created attachment 199416 [details]
qlop -l output
this is my qlop -l output~
+ return 0 + touch /home/yegle/temp/portage/sys-devel/gcc-4.4.1/.compiled + vecho '>>> Source compiled.' + quiet_mode + [[ '' -eq 1 ]] + echo '>>> Source compiled.' >>> Source compiled. + ebuild_phase post_src_compile + declare -F post_src_compile + trap - SIGINT SIGQUIT + set +x ^C the last lines of emerge --debug log the complete log is too large, so I uploaded it here:http://yegle.net/emerge_debug_log (In reply to comment #4) > Ok how about another test, as I try to figure out who should get assigned this > interesting problem (portage, sandbox, toolchain devs)... > > Run your emerge with --debug, save the output to a file, and attach here > > Also, could you post output of 'qlop -l' to show what got upgraded before you > got stuck on gcc emerge > Ok, it doesn't look like a problem launching sandbox -- compile phase finished, and install phase didn't start yet. Assigning to portage team. There must be some process forked from the ebuild running in the background. If `ps axf` doesn't show it, maybe try pstree. You can also try running it with FEATURES=userpriv and search for processes running under the 'portage' user. (In reply to comment #8) > There must be some process forked from the ebuild running in the background. If > `ps axf` doesn't show it, maybe try pstree. You can also try running it with > FEATURES=userpriv and search for processes running under the 'portage' user. > actually,there is no other process forked from ebuild... I tried to emerge gcc with FEATURES=userpriv, seems the result is a little different pstree: `-yakuake-+-bash---pstree |-bash---emerge---ebuild.sh `-{yakuake} $ ps -ef|grep ^portage portage 6626 5306 0 16:11 pts/1 00:00:00 [ebuild.sh] <defunct> portage 13581 1 0 16:23 pts/1 00:00:00 /bin/sh (In reply to comment #9) > $ ps -ef|grep ^portage > portage 6626 5306 0 16:11 pts/1 00:00:00 [ebuild.sh] <defunct> > portage 13581 1 0 16:23 pts/1 00:00:00 /bin/sh It must come from the somewhere within the ebuild and/or gcc build system. I'm not aware of a clean/portable way for portage to clean up an orphaned process such as this. I suppose we could just clean up the ebuild.sh process and leave any orphans running in the background. @toolchain: Has anyone else experienced a /bin/sh orphan like this? (In reply to comment #10) > It must come from the somewhere within the ebuild and/or gcc build system. I'm > not aware of a clean/portable way for portage to clean up an orphaned process > such as this. I suppose we could just clean up the ebuild.sh process and leave > any orphans running in the background. I guess we can put ebuild.sh and all of it's subprocesses into a process group and that should make it easy to clean up any orphans. Creating a separate login session for ebuild.sh seems somewhat complex, so it might be better to first focus on implementing a fix that ignores orphan processes and simply leaves them running in the background (they might consume a pty device, but that's probably negligible). Created attachment 199617 [details, diff]
make ebuild.sh clean up orphaned processes
If this patch is saved as /tmp/cleanup_orphans.patch, then it can be applied as follows:
cd /usr/lib/portage
patch -p0 < /tmp/cleanup_orphans.patch
For now, I've reverted this patch since we're going to need a daemon process in the ebuild's login session in order to pass signals from the controlling terminal to the detached session. A simple fifo-based approach in ebuild.sh does not seem to work since bash's read builtin occasionally loses the fifo data when it's 'Interrupted system call'. Maybe a python script will work better for the session leader/daemon. NOTE: The daemon will also be useful for implementing a fifo-based die helper (to replace the current signal-based approach). Seems that this one is blocking also Bug #269283 (https://bugs.gentoo.org/show_bug.cgi?id=269283). This is fixed in git: http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=9a5f9cf8f6a8ff78cc124c40aaebcedd7be8d059 It just leaves the process(es) running, but at least emerge doesn't hang now. *** Bug 306265 has been marked as a duplicate of this bug. *** (In reply to comment #16) > This is fixed in git: > > http://git.overlays.gentoo.org/gitweb/?p=proj/portage.git;a=commit;h=9a5f9cf8f6a8ff78cc124c40aaebcedd7be8d059 > > It just leaves the process(es) running, but at least emerge doesn't hang now. I just tested your fix and it works, but only if I _don't_ use MAKEOPTS="-j2". Otherwise it just keeps hanging. (In reply to comment #18) > I just tested your fix and it works, but only if I _don't_ use MAKEOPTS="-j2". > > Otherwise it just keeps hanging. Please make sure you have the latest version of portage from git since there were some bug fixes in the ipc code. If you still have the problem, please check if `lsof | grep .ipc` show emerge is still listening on $PORTAGE_TMPDIR/portage/sys-devel/gcc-*/.ipc_in which would indicate that the ebuild is hung up. If .ipc_in does not show in the lsof of output then it means that emerge itself is hung up. In this case it's possible to send a SIGUSR1 signal to the emerge process and see where it's hung up, like this: kill -s SIGUSR1 <emerge pid> That will bring you to the (Pdb) prompt, where you should issue these two commands: step bt The bt command will display a backtrace that shows where emerge is hung up. This is in the 2.2_rc68, but I'll leave this bug open until it's in an unmasked version. This is fixed in 2.1.9. Please file new a new bug if it's not working as expected. (In reply to comment #21) > This is fixed in 2.1.9. > > Please file new a new bug if it's not working as expected. I'm sorry, but here you are: bug 335950. (In reply to comment #19) > (In reply to comment #18) > > I just tested your fix and it works, but only if I _don't_ use > > MAKEOPTS="-j2". > > > > Otherwise it just keeps hanging. > > Please make sure you have the latest version of portage from git since there > were some bug fixes in the ipc code. > > If you still have the problem, please check if `lsof | grep .ipc` show emerge > is still listening on $PORTAGE_TMPDIR/portage/sys-devel/gcc-*/.ipc_in which > would indicate that the ebuild is hung up. > > If .ipc_in does not show in the lsof of output then it means that emerge > itself > is hung up. In this case it's possible to send a SIGUSR1 signal to the emerge > process and see where it's hung up, like this: > > kill -s SIGUSR1 <emerge pid> > > That will bring you to the (Pdb) prompt, where you should issue these two > commands: > > step > bt > > The bt command will display a backtrace that shows where emerge is hung up. Thanks for this debugging hints. I added the output there: bug 335950 comment 3 bug 335950 comment 4 NOTE: The fix for this bug only works as long as USE=ipc is enabled (it is enabled automatically by IUSE default). |