Summary: | app-editors/emacs 27.2-r3[imagemagick]: temacs child process is hanging in vfork (seen with CC=clang) | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Xi <davidshen84> |
Component: | Current packages | Assignee: | GNU Emacs project <gnu-emacs> |
Status: | RESOLVED WORKSFORME | ||
Severity: | normal | CC: | dev-portage, flow, sam, sandbox |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
build.log
environment kernel-conf 0001-libsandbox-drop-synchronization-from-vfork-and-fork.patch |
Description
Xi
2021-07-08 23:55:01 UTC
If I executed this command from the source directory, it passes. LC_ALL=C ./temacs -batch -l loadup --temacs=pdump Created attachment 722860 [details]
environment
I found if I go to the source directory and execute "make", the build would succeed and produce a usable emacs program. So, at this point, it is pretty clear the problem is with emerge. I found this post, https://forums.gentoo.org/viewtopic-t-1092666-start-0.html, that talked about namespace and sandbox, so I tried and finally succeeded with disabling all "sandbox" features. sudo FEATURES="-pid-sandbox -network-sandbox -sandbox -usersandbox" emerge -1 emacs So...I guess there's something wrong with my namespace kernel configuration? Created attachment 722881 [details]
kernel-conf
(In reply to Xi from comment #4) > sudo FEATURES="-pid-sandbox -network-sandbox -sandbox -usersandbox" emerge -1 emacs That's a start. Can you disable them one by one, so we see what feature exactly triggers the problem? usersandbox it is So far, I haven't been able to reproduce the problem. CCing Portage team: Any idea what could be the reason for this? In a nutshell, emacs tries to execute "git rev-parse HEAD" (using fork and execve) during build, but it hangs if usersandbox is enabled. (In reply to Ulrich Müller from comment #8) > So far, I haven't been able to reproduce the problem. > > CCing Portage team: Any idea what could be the reason for this? > > In a nutshell, emacs tries to execute "git rev-parse HEAD" (using fork and > execve) during build, but it hangs if usersandbox is enabled. Well, I said it stuck at the message "Waiting for Git...", but based on my investigation, it actually stuck at executing line 588 in the Makefile which is at the root of the source directory. I looked into the Makefile, line 588 is executing pdump and the command matches the command I found through "ps aux|grep emacs". (In reply to Xi from comment #9) > Well, I said it stuck at the message "Waiting for Git...", but based on my > investigation, it actually stuck at executing line 588 in the Makefile which > is at the root of the source directory. > > I looked into the Makefile, line 588 is executing pdump and the command > matches the command I found through "ps aux|grep emacs". Right, and the temacs binary executed from that line in the Makefile runs the following code in loadup.el: https://git.savannah.gnu.org/cgit/emacs.git/tree/lisp/loadup.el?h=emacs-27.2#n399 This (indirectly) calls the emacs-repository-version-git and emacs-repository-branch-git functions: https://git.savannah.gnu.org/cgit/emacs.git/tree/lisp/version.el?h=emacs-27.2#n114 https://git.savannah.gnu.org/cgit/emacs.git/tree/lisp/version.el?h=emacs-27.2#n143 The first command in both of them is to print the message "Waiting for git...", i.e. in a successful build you would see that message twice. That the log doesn't continue after the first message is a strong indication that it hangs in emacs-repository-version-git, most likely when trying to execute this command: (call-process "git" nil '(t nil) nil "rev-parse" "HEAD") So it actually stuck at executing some elisp script. I tried to execute that "call-process" function both as root and portage, and I got a return value of 128 with both accounts. I think the key is still what that "usersandbox" feature doing in this situation. (In reply to Xi from comment #11) > I tried to execute that "call-process" function both as root and portage, > and I got a return value of 128 with both accounts. I think this is correct and as intended. "git rev-parse HEAD" will return with an error status when executed outside a git repository (which is the case here). The point is, it shouldn't get stuck. > I think the key is still what that "usersandbox" feature doing in this > situation. Copying the sys-apps/sandbox maintainers since you seem to have narrowed this down to an issue with usersandbox. Having something smaller that reproduces failure would be nice. I tried and failed the following: # USE='-elogind xml abi_x86_64 acl alsa amd64 cairo dbus elibc_glibc gif gmp gpm gui harfbuzz imagemagick inotify jpeg json kernel_linux lcms libxml2 m17n-lib png ssl svg systemd threads tiff userland_GNU wide-int xft xpm zlib' FEATURES="ccache network-sandbox preserve-libs sandbox userpriv usersandbox" emerge -v1 =emacs-27.2-r3 emacs-27.2-r3 installs without problems. Oh, I think I've triggered it by adding CC=clang: # CC="clang" CXX="clang++" USE='-elogind xml abi_x86_64 acl alsa amd64 cairo dbus elibc_glibc gif gmp gpm gui harfbuzz imagemagick inotify jpeg json kernel_linux lcms libxml2 m17n-lib png ssl svg systemd threads tiff userland_GNU wide-int xft xpm zlib' FEATURES="ccache network-sandbox preserve-libs sandbox userpriv usersandbox" emerge -v1 =emacs-27.2-r3 I'll explore in more detail. I can reproduce the problem with: $ CC="clang" USE='-* gui imagemagick' FEATURES="-* usersandbox" ebuild emacs-27.2-r3.ebuild clean compile Strangely, with -imagemagick the build succeeds. Then again, if I build media-gfx/imagemagick with -openmp then the problem disappears. So we have: emacs[-imagemagick] -- success emacs[+imagemagick] imagemagick[-openmp] -- success emacs[+imagemagick] imagemagick[+openmp] -- failure Recompiling libomp and imagemagick with clang (in order to have a consistent toolchain) didn't make any difference. (In reply to Sergei Trofimovich from comment #15) > Oh, I think I've triggered it by adding CC=clang: > > # CC="clang" CXX="clang++" USE='-elogind xml abi_x86_64 acl alsa amd64 > cairo dbus elibc_glibc gif gmp gpm gui harfbuzz imagemagick inotify jpeg > json kernel_linux lcms libxml2 m17n-lib png ssl svg systemd threads tiff > userland_GNU wide-int xft xpm zlib' FEATURES="ccache network-sandbox > preserve-libs sandbox userpriv usersandbox" emerge -v1 =emacs-27.2-r3 > > I'll explore in more detail. I have CC=clang enabled globally. After I hit this bug. I also tried to set the package.env for emacs to use gcc. But I still observed the same behaviour. I also tried to turn off CCACHE feature. One possibility is that it is not "clang" that causing this issue, but some other packages that compiled "clang" that caused this issue. When attaching gdb to the temacs processes, I see this in the parent: (gdb) bt #0 0x00007ffff79441ee in __libc_read (fd=fd@entry=6, buf=buf@entry=0x7ffffffe8a40, nbytes=nbytes@entry=16384) at ../sysdeps/unix/sysv/linux/read.c:26 #1 0x00000000005068a2 in emacs_intr_read (fd=6, buf=0x7ffffffe8a40, nbyte=16384, interruptible=true) at sysdep.c:2620 #2 emacs_read_quit (fd=fd@entry=6, buf=0x7ffffffe8a40, nbyte=16384) at sysdep.c:2641 #3 0x00000000005b6d5d in call_process (nargs=<optimized out>, nargs@entry=6, args=<optimized out>, args@entry=0x7fffffff93d0, filefd=filefd@entry=5, tempfile_index=<optimized out>, tempfile_index@entry=-1) at callproc.c:764 The child process is more interesting, the backtrace shows that it never returns from vfork(): (gdb) bt #0 futex_wait (private=0, expected=2, futex_word=0x7ffff7f9c340 <lock>) at ../sysdeps/nptl/futex-internal.h:146 #1 __lll_lock_wait (futex=futex@entry=0x7ffff7f9c340 <lock>, private=0) at lowlevellock.c:52 #2 0x00007ffff793d0c3 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7ffff7f9c340 <lock>) at ../nptl/pthread_mutex_lock.c:80 #3 0x00007ffff7f85a6c in sb_lock () at /tmp/portage/sys-apps/sandbox-2.24/work/sandbox-2.24/libsandbox/lock.c:16 #4 0x00007ffff7f846f2 in before_syscall (dirfd=<optimized out>, sb_nr=-4, func=0x7ffff7f8e417 "open_wr", file=0x7ffffffe8540 "/dev/shm/__KMP_REGISTERED_LIB_6807_1000", flags=655554) at /tmp/portage/sys-apps/sandbox-2.24/work/sandbox-2.24/libsandbox/libsandbox.c:1060 #5 0x00007ffff7f89f84 in open_DEFAULT (pathname=pathname@entry=0x7ffffffe8540 "/dev/shm/__KMP_REGISTERED_LIB_6807_1000", flags=flags@entry=655554) at /tmp/portage/sys-apps/sandbox-2.24/work/sandbox-2.24/libsandbox/wrapper-funcs/__wrapper_simple.c:52 #6 0x00007ffff79c5717 in shm_open (name=0x114dc11 "__KMP_REGISTERED_LIB_6807_1000", name@entry=0x114dc10 "/__KMP_REGISTERED_LIB_6807_1000", oflag=655554, oflag@entry=194, mode=mode@entry=438) at ../sysdeps/posix/shm_open.c:44 #7 0x00007ffff773227f in __kmp_register_library_startup() () at /tmp/portage/sys-libs/libomp-12.0.1/work/openmp/runtime/src/kmp_runtime.cpp:6465 #8 0x00007ffff7732644 in __kmp_do_serial_initialize() () at /tmp/portage/sys-libs/libomp-12.0.1/work/openmp/runtime/src/kmp_runtime.cpp:6719 #9 0x00007ffff77334ef in __kmp_serial_initialize() () at /tmp/portage/sys-libs/libomp-12.0.1/work/openmp/runtime/src/kmp_runtime.cpp:6990 #10 0x00007ffff75ad1e3 in __run_fork_handlers (who=who@entry=atfork_run_child, do_locking=do_locking@entry=false) at register-atfork.c:135 #11 0x00007ffff75f24cd in __libc_fork () at ../sysdeps/nptl/fork.c:136 #12 0x00007ffff7f89cf7 in vfork () at /usr/lib64/libsandbox.so #13 0x00000000005b69ad in call_process (nargs=<optimized out>, nargs@entry=6, args=args@entry=0x7fffffff93d0, filefd=filefd@entry=5, tempfile_index=tempfile_index@entry=-1) at callproc.c:613 (In reply to Ulrich Müller from comment #19) > When attaching gdb to the temacs processes, I see this in the parent: > > (gdb) bt > #0 0x00007ffff79441ee in __libc_read (fd=fd@entry=6, > buf=buf@entry=0x7ffffffe8a40, nbytes=nbytes@entry=16384) at > ../sysdeps/unix/sysv/linux/read.c:26 > #1 0x00000000005068a2 in emacs_intr_read (fd=6, buf=0x7ffffffe8a40, > nbyte=16384, interruptible=true) at sysdep.c:2620 > #2 emacs_read_quit (fd=fd@entry=6, buf=0x7ffffffe8a40, nbyte=16384) at > sysdep.c:2641 > #3 0x00000000005b6d5d in call_process (nargs=<optimized out>, > nargs@entry=6, args=<optimized out>, args@entry=0x7fffffff93d0, > filefd=filefd@entry=5, tempfile_index=<optimized out>, > tempfile_index@entry=-1) at callproc.c:764 > > The child process is more interesting, the backtrace shows that it never > returns from vfork(): > > (gdb) bt > #0 futex_wait (private=0, expected=2, futex_word=0x7ffff7f9c340 <lock>) at > ../sysdeps/nptl/futex-internal.h:146 > #1 __lll_lock_wait (futex=futex@entry=0x7ffff7f9c340 <lock>, private=0) at > lowlevellock.c:52 > #2 0x00007ffff793d0c3 in __GI___pthread_mutex_lock > (mutex=mutex@entry=0x7ffff7f9c340 <lock>) at ../nptl/pthread_mutex_lock.c:80 > #3 0x00007ffff7f85a6c in sb_lock () at > /tmp/portage/sys-apps/sandbox-2.24/work/sandbox-2.24/libsandbox/lock.c:16 > #4 0x00007ffff7f846f2 in before_syscall (dirfd=<optimized out>, sb_nr=-4, > func=0x7ffff7f8e417 "open_wr", file=0x7ffffffe8540 > "/dev/shm/__KMP_REGISTERED_LIB_6807_1000", flags=655554) > at > /tmp/portage/sys-apps/sandbox-2.24/work/sandbox-2.24/libsandbox/libsandbox.c: > 1060 > #5 0x00007ffff7f89f84 in open_DEFAULT > (pathname=pathname@entry=0x7ffffffe8540 > "/dev/shm/__KMP_REGISTERED_LIB_6807_1000", flags=flags@entry=655554) > at > /tmp/portage/sys-apps/sandbox-2.24/work/sandbox-2.24/libsandbox/wrapper- > funcs/__wrapper_simple.c:52 > #6 0x00007ffff79c5717 in shm_open (name=0x114dc11 > "__KMP_REGISTERED_LIB_6807_1000", name@entry=0x114dc10 > "/__KMP_REGISTERED_LIB_6807_1000", oflag=655554, > oflag@entry=194, mode=mode@entry=438) at ../sysdeps/posix/shm_open.c:44 > #7 0x00007ffff773227f in __kmp_register_library_startup() () at > /tmp/portage/sys-libs/libomp-12.0.1/work/openmp/runtime/src/kmp_runtime.cpp: > 6465 > #8 0x00007ffff7732644 in __kmp_do_serial_initialize() () at > /tmp/portage/sys-libs/libomp-12.0.1/work/openmp/runtime/src/kmp_runtime.cpp: > 6719 > #9 0x00007ffff77334ef in __kmp_serial_initialize() () at > /tmp/portage/sys-libs/libomp-12.0.1/work/openmp/runtime/src/kmp_runtime.cpp: > 6990 > #10 0x00007ffff75ad1e3 in __run_fork_handlers > (who=who@entry=atfork_run_child, do_locking=do_locking@entry=false) at > register-atfork.c:135 > #11 0x00007ffff75f24cd in __libc_fork () at ../sysdeps/nptl/fork.c:136 > #12 0x00007ffff7f89cf7 in vfork () at /usr/lib64/libsandbox.so > #13 0x00000000005b69ad in call_process (nargs=<optimized out>, > nargs@entry=6, args=args@entry=0x7fffffff93d0, filefd=filefd@entry=5, > tempfile_index=tempfile_index@entry=-1) > at callproc.c:613 sandbox recently started using fork() to implement vfork(): https://gitweb.gentoo.org/proj/sandbox.git/commit/?id=f43378e14396fe5fad05bff13a73483740205881 Mechanically the hangup happens in sandbox init code where libomp library (clang's -fopenmp) atfork handlers get called before sandbox is initialized. That is typical for .init-style constructors. They are also trickiest to get to work for sandbox :) Normally sandbox tries to initialize by grabbing the sandbox global mutex, but it's already held by something. Perhaps a vfork() sandbox wrapper. I'll try to craft smaller example that illustrates deadlock. The following seems to be enough: #include <stdio.h> #include <sys/types.h> #include <unistd.h> #include <sys/wait.h> #include <omp.h> #define N 1000 void f(double A[N], double B[N], double C[N]) __attribute__((noinline)); void f(double A[N], double B[N], double C[N]) { int i; #pragma omp parallel for for (i=0; i<N; i++) { A[i] = B[i] + C[i]; } } int main() { double A[N], B[N], C[N]; f(A, B, C); pid_t pid = vfork(); if (pid == 0) { /* child */ execl("/bin/echo", "/bin/echo", "child", (char*)NULL); } int ws; pid_t p = wait(&ws); printf("parent found child %u/%u\n", pid, p); } $ clang a.c -o a -fopenmp && time ./a child parent found child 1534983/1534983 real 0m0,003s user 0m0,011s sys 0m0,004s $ clang a.c -o a -fopenmp && time sandbox ./a <hung> ^Csandbox:stop caught signal 2 in pid 1537755 Sandboxed process killed by signal: Interrupt real 0m1,608s user 0m2,669s sys 0m0,084s backtrace looks close: (gdb) bt #0 futex_wait (private=0, expected=2, futex_word=0x7fba1b9cf340 <lock>) at ../sysdeps/nptl/futex-internal.h:146 #1 __lll_lock_wait (futex=futex@entry=0x7fba1b9cf340 <lock>, private=0) at lowlevellock.c:52 #2 0x00007fba1b88e0e3 in __GI___pthread_mutex_lock (mutex=0x7fba1b9cf340 <lock>) at ../nptl/pthread_mutex_lock.c:80 #3 0x00007fba1b9b67c2 in before_syscall () at libsandbox.c:1060 #4 0x00007fba1b9bc128 in open_DEFAULT () at wrapper-funcs/__wrapper_simple.c:52 #5 0x00007fba1b6ba8c8 in shm_open (name=0x74fdf1 "__KMP_REGISTERED_LIB_1571248_1000", oflag=655554, mode=438) at ../sysdeps/posix/shm_open.c:44 #6 0x00007fba1b8e61ef in __kmp_register_library_startup() () from /usr/lib64/libomp.so #7 0x00007fba1b8e6744 in __kmp_do_serial_initialize() () from /usr/lib64/libomp.so #8 0x00007fba1b8e762f in __kmp_serial_initialize () from /usr/lib64/libomp.so #9 0x00007fba1b7481eb in __run_fork_handlers (who=who@entry=atfork_run_child, do_locking=do_locking@entry=true) at register-atfork.c:135 #10 0x00007fba1b78da9d in __libc_fork () at ../sysdeps/nptl/fork.c:136 #11 0x00007fba1b9bbe87 in vfork () at wrapper-funcs/__wrapper_simple.c:47 #12 0x0000000000401391 in main () at a.c:25 Created attachment 725644 [details, diff]
0001-libsandbox-drop-synchronization-from-vfork-and-fork.patch
Having thought a bit more about it fork() (and vfork() via fork()) should never require any synchronization. The patch drops any synchronization.
Please give patch a go. You can drop it to /etc/portage/patches/sys-apps/sandbox/ and rebuild sandbox.
(In reply to Sergei Trofimovich from comment #22) > Created attachment 725644 [details, diff] [details, diff] > 0001-libsandbox-drop-synchronization-from-vfork-and-fork.patch > > Having thought a bit more about it fork() (and vfork() via fork()) should > never require any synchronization. The patch drops any synchronization. > > Please give patch a go. You can drop it to > /etc/portage/patches/sys-apps/sandbox/ and rebuild sandbox. Please ignore. It does not prevent other threads from holding a sb_lock() lock at fork() time. Causes deadlocks dev-python/gevent build. Is this still a problem in Emacs 29.2? (In reply to Ulrich Müller from comment #24) > Is this still a problem in Emacs 29.2? No reply. Closing. |