Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 828844 - sys-apps/portage: newconfd/doins hang when running in podman
Summary: sys-apps/portage: newconfd/doins hang when running in podman
Status: RESOLVED FIXED
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: All Linux
: Normal normal
Assignee: Portage team
URL:
Whiteboard:
Keywords: InVCS
Depends on:
Blocks: 635020
  Show dependency tree
 
Reported: 2021-12-10 16:49 UTC by John Helmert III
Modified: 2022-03-16 13:50 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
When copy_file_range copied zero bytes, fall back to sendfile (untested) (file_copy-handle-zero-bytes-copied-by-copy_file_range.patch,1.45 KB, patch)
2021-12-11 20:49 UTC, Zac Medico
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description John Helmert III archtester Gentoo Infrastructure gentoo-dev Security 2021-12-10 16:49:57 UTC
I've lately been experimenting with doing some development in Podman containers, but I keep hitting a hang related to newconfd/doins. The following is in process of building dev-lang/python-3.10.1 but I have hit this hang with other packages.

Running portage processes:

6e4871941b40 / # ps aux | grep portage
root       41010  0.0  0.0   2592  1760 pts/0    S+   15:48   0:00 [dev-lang/python-3.10.1] sandbox /usr/lib/portage/python3.9/ebuild.sh install
root       41011  0.0  0.0  10952  8216 pts/0    S+   15:48   0:00 /bin/bash /usr/lib/portage/python3.9/ebuild.sh install
root       41027  0.0  0.0  11216  7600 pts/0    S+   15:48   0:00 /bin/bash /usr/lib/portage/python3.9/ebuild.sh install
root       46673  0.0  0.0   7524  4892 pts/0    S+   15:49   0:00 /bin/bash /usr/lib/portage/python3.9/ebuild-helpers/newconfd /var/tmp/portage/dev-lang/python-3.10.1/files/pydoc.conf pydoc-3.10
root       46676  0.0  0.0   7524  4620 pts/0    S+   15:49   0:00 /bin/bash /usr/lib/portage/python3.9/ebuild-helpers/doins /var/tmp/portage/dev-lang/python-3.10.1/temp/.newins_tmp_YVWNlp/pydoc-3.10
root       46677 92.1  0.0  24788 19800 pts/0    R+   15:49  28:29 /usr/bin/python3.9 /usr/lib/portage/python3.9/doins.py --preserve_symlinks --helpers_can_die --insoptions=-m0644 --diroptions=-m0755 --distdir=/var/cache/distfiles --enable_copy_xattr --xattr_exclude=btrfs.* security.evm security.ima .security.selinux system.nfs4_acl user.apache_handler .user.Beagle.* user.dublincore.* user.mime_encoding user.xdg.* --helper=doins --dest=/var/tmp/portage/dev-lang/python-3.10.1/image/etc/conf.d/ -- /var/tmp/portage/dev-lang/python-3.10.1/temp/.newins_tmp_YVWNlp/pydoc-3.10

The doins.py process seems to be infinitely looping (calling strace with the PID from the host's namespace as the container doesn't have strace):

~ # strace -p 3891753 2>&1 | head -30
strace: Process 3891753 attached
lseek(3, 0, SEEK_HOLE)                  = 169
copy_file_range(3, [0], 4, [0], 169, 0) = 0
lseek(3, 0, SEEK_DATA)                  = 0
lseek(3, 0, SEEK_HOLE)                  = 169
copy_file_range(3, [0], 4, [0], 169, 0) = 0
lseek(3, 0, SEEK_DATA)                  = 0
lseek(3, 0, SEEK_HOLE)                  = 169
copy_file_range(3, [0], 4, [0], 169, 0) = 0
lseek(3, 0, SEEK_DATA)                  = 0
lseek(3, 0, SEEK_HOLE)                  = 169
copy_file_range(3, [0], 4, [0], 169, 0) = 0
lseek(3, 0, SEEK_DATA)                  = 0
lseek(3, 0, SEEK_HOLE)                  = 169
copy_file_range(3, [0], 4, [0], 169, 0) = 0
lseek(3, 0, SEEK_DATA)                  = 0
lseek(3, 0, SEEK_HOLE)                  = 169                                                                                                                                                                                                                                     copy_file_range(3, [0], 4, [0], 169, 0) = 0                                                                                                                                                                                                                                       lseek(3, 0, SEEK_DATA)                  = 0
lseek(3, 0, SEEK_HOLE)                  = 169
copy_file_range(3, [0], 4, [0], 169, 0) = 0
lseek(3, 0, SEEK_DATA)                  = 0
lseek(3, 0, SEEK_HOLE)                  = 169
copy_file_range(3, [0], 4, [0], 169, 0) = 0
lseek(3, 0, SEEK_DATA)                  = 0
lseek(3, 0, SEEK_HOLE)                  = 169
copy_file_range(3, [0], 4, [0], 169, 0) = 0                                                                                                                                                                                                                                       lseek(3, 0, SEEK_DATA)                  = 0                                                                                                                                                                                                                                       lseek(3, 0, SEEK_HOLE)                  = 169
copy_file_range(3, [0], 4, [0], 169, 0) = 0

Checking the process's file descriptors:

6e4871941b40 / # ls -l /proc/46677/fd
total 0
lr-x------ 1 root root 64 Dec 10 15:56 0 -> /dev/null
lrwx------ 1 root root 64 Dec 10 16:37 1 -> /dev/pts/2                                                                                                                                                                                                                            lrwx------ 1 root root 64 Dec 10 16:37 2 -> /dev/pts/2                                                                                                                                                                                                                            lr-x------ 1 root root 64 Dec 10 16:37 3 -> /var/tmp/portage/dev-lang/python-3.10.1/temp/.newins_tmp_YVWNlp/pydoc-3.10
l-wx------ 1 root root 64 Dec 10 16:37 4 -> /var/tmp/portage/dev-lang/python-3.10.1/image/etc/conf.d/pydoc-3.10

6e4871941b40 / # cat /var/tmp/portage/dev-lang/python-3.10.1/temp/.newins_tmp_YVWNlp/pydoc-3.10
# /etc/init.d/pydoc.conf

# This file contains the configuration for pydoc's internal webserver.

# Default port for Python's pydoc server.                                                                                                                                                                                                                                         @PYDOC_PORT_VARIABLE@="7464"
6e4871941b40 / # cat /var/tmp/portage/dev-lang/python-3.10.1/image/etc/conf.d/pydoc-3.10
[empty]
6e4871941b40 / # du -b /var/tmp/portage/dev-lang/python-3.10.1/temp/.newins_tmp_YVWNlp/pydoc-3.10
169     /var/tmp/portage/dev-lang/python-3.10.1/temp/.newins_tmp_YVWNlp/pydoc-3.10

I don't think there's anything special about the filesystem; it's just ext4 under podman (no SELinux or anything, either). Attaching to the doins.py process with gdb and running py-bt isn't helpful; the plugin can't make sense of the backtrace, but:

(gdb) bt
#0  0x00007f6072296667 in lseek64 () from /lib64/libc.so.6
#1  0x00007f607121b562 in ?? () from /usr/lib/python3.9/site-packages/portage/util/file_copy/reflink_linux.cpython-39-x86_64-linux-gnu.so
#2  0x00007f60724763c4 in ?? () from /usr/lib64/libpython3.9.so.1.0
#3  0x00007f607243b1cd in _PyObject_MakeTpCall () from /usr/lib64/libpython3.9.so.1.0
#4  0x00007f60723fe859 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.9.so.1.0
#5  0x00007f60723f79db in ?? () from /usr/lib64/libpython3.9.so.1.0
#6  0x00007f60723fdd05 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.9.so.1.0
#7  0x00007f60723f79db in ?? () from /usr/lib64/libpython3.9.so.1.0
#8  0x00007f60723fed9b in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.9.so.1.0
#9  0x00007f60723f79db in ?? () from /usr/lib64/libpython3.9.so.1.0
#10 0x00007f60723fed9b in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.9.so.1.0
#11 0x00007f60723f79db in ?? () from /usr/lib64/libpython3.9.so.1.0
#12 0x00007f60723fdd05 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.9.so.1.0
#13 0x00007f60723f79db in ?? () from /usr/lib64/libpython3.9.so.1.0
#14 0x00007f60723fdd05 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.9.so.1.0
#15 0x00007f607250064d in ?? () from /usr/lib64/libpython3.9.so.1.0
#16 0x00007f607250097e in _PyEval_EvalCodeWithName () from /usr/lib64/libpython3.9.so.1.0
#17 0x00007f60725009cb in PyEval_EvalCodeEx () from /usr/lib64/libpython3.9.so.1.0
#18 0x00007f60725009fb in PyEval_EvalCode () from /usr/lib64/libpython3.9.so.1.0
#19 0x00007f607253c993 in ?? () from /usr/lib64/libpython3.9.so.1.0
#20 0x00007f607253cba7 in ?? () from /usr/lib64/libpython3.9.so.1.0
#21 0x00007f607253f6cc in ?? () from /usr/lib64/libpython3.9.so.1.0
#22 0x00007f607253f86b in PyRun_SimpleFileExFlags () from /usr/lib64/libpython3.9.so.1.0
#23 0x00007f607255af17 in Py_RunMain () from /usr/lib64/libpython3.9.so.1.0
#24 0x00007f607255b1ca in Py_BytesMain () from /usr/lib64/libpython3.9.so.1.0
#25 0x00007f60721c61ca in ?? () from /lib64/libc.so.6
#26 0x00007f60721c6278 in __libc_start_main () from /lib64/libc.so.6
#27 0x000055f0dc830081 in _start ()

I'm at a loss for how to debug this further, but this seems to be easy to reproduce so I'm happy to add anything that would help debugging.
Comment 1 John Helmert III archtester Gentoo Infrastructure gentoo-dev Security 2021-12-11 06:41:01 UTC
Hacking in a `-m trace -t` to the Python invocation in /usr/lib/portage/python3.9/ebuild-helpers/doins and running a triggering build eventually hangs here:

 --- modulename: doins, funcname: _is_install_allowed
doins.py(228):         try:
doins.py(229):             dest_lstat = os.lstat(dest)
doins.py(230):         except OSError as e:
doins.py(233):             if e.errno == errno.ENOENT:
doins.py(234):                 return True
doins.py(187):         try:
doins.py(188):             os.unlink(dest)
doins.py(189):         except OSError as e:
doins.py(192):             if e.errno != errno.ENOENT:
doins.py(194):         try:
doins.py(195):             copyfile(source, dest)
 --- modulename: __init__, funcname: _optimized_copyfile
__init__.py(28):     with open(src, "rb", buffering=0) as src_file, open(
__init__.py(29):         dst, "wb", buffering=0
__init__.py(28):     with open(src, "rb", buffering=0) as src_file, open(
__init__.py(30):     ) as dst_file:
__init__.py(31):         _file_copy(src_file.fileno(), dst_file.fileno())
Comment 2 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2021-12-11 06:42:14 UTC
Could you try turning off the native extensions flag on Portage?
Comment 3 John Helmert III archtester Gentoo Infrastructure gentoo-dev Security 2021-12-11 06:50:14 UTC
(In reply to Sam James from comment #2)
> Could you try turning off the native extensions flag on Portage?

This works!
Comment 4 John Helmert III archtester Gentoo Infrastructure gentoo-dev Security 2021-12-11 07:21:23 UTC
Perhaps being triggered by this while loop:

https://github.com/gentoo/portage/blob/master/src/portage_util_file_copy_reflink_linux.c#L246
Comment 5 John Helmert III archtester Gentoo Infrastructure gentoo-dev Security 2021-12-11 07:46:11 UTC
(In reply to John Helmert III from comment #4)
> Perhaps being triggered by this while loop:
> 
> https://github.com/gentoo/portage/blob/master/src/
> portage_util_file_copy_reflink_linux.c#L246

Indeed:

(gdb) bt
#0  0x00007ffb4b8f6667 in lseek64 () from /lib64/libc.so.6
#1  0x00007ffb4a885385 in do_lseek_data (fd_out=4, fd_in=3, off_out=0x7ffd333fca20) at src/portage_util_file_copy_reflink_linux.c:137
#2  0x00007ffb4a8855a6 in _reflink_linux_file_copy (self=<module at remote 0x7ffb4a8ab9a0>, Python Exception <class 'gdb.error'>: There is no member named ob_item.
args=) at src/portage_util_file_copy_reflink_linux.c:247
Comment 6 Zac Medico gentoo-dev 2021-12-11 20:32:00 UTC
(In reply to John Helmert III from comment #0)
> ~ # strace -p 3891753 2>&1 | head -30
> strace: Process 3891753 attached
> lseek(3, 0, SEEK_HOLE)                  = 169
> copy_file_range(3, [0], 4, [0], 169, 0) = 0
> lseek(3, 0, SEEK_DATA)                  = 0
> lseek(3, 0, SEEK_HOLE)                  = 169
> copy_file_range(3, [0], 4, [0], 169, 0) = 0
> lseek(3, 0, SEEK_DATA)                  = 0
> lseek(3, 0, SEEK_HOLE)                  = 169

It looks like copy_file_range copies zero bytes, and then we continuously retry without making any progress because copy_file_range "successfully" copies zero bytes on each try.

> I don't think there's anything special about the filesystem; it's just ext4
> under podman (no SELinux or anything, either).

Is /var/tmp/portage a bind mount to a plain ext4 filesystem? If not, I would expect it to using one of podmans filesystem drivers such as overlayfs or fuse-overlayfs. Anyway, I'll look into handling the case where copy_file_range copies zero bytes somehow (we can fallback to sendfile or a copy with a byte array in this case).
Comment 7 Zac Medico gentoo-dev 2021-12-11 20:49:51 UTC
Created attachment 758621 [details, diff]
When copy_file_range copied zero bytes, fall back to sendfile (untested)

Please test this patch.
Comment 8 John Helmert III archtester Gentoo Infrastructure gentoo-dev Security 2021-12-11 22:32:18 UTC
(In reply to Zac Medico from comment #7)
> Created attachment 758621 [details, diff] [details, diff]
> When copy_file_range copied zero bytes, fall back to sendfile (untested)
> 
> Please test this patch.

This seems to work!

Tested-by: John Helmert III <ajak@gentoo.org>
Comment 9 Larry the Git Cow gentoo-dev 2021-12-12 01:14:54 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=fe2e58325ffd1d4424564998f64bed4cb4ab8ffa

commit fe2e58325ffd1d4424564998f64bed4cb4ab8ffa
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2021-12-11 20:40:04 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2021-12-12 01:14:13 +0000

    file_copy: handle zero bytes copied by copy_file_range (bug 828844)
    
    When copy_file_range copied zero bytes, fall back to sendfile,
    so that we don't call copy_file_range in an infinite loop.
    
    Bug: https://bugs.gentoo.org/828844
    Tested-by: John Helmert III <ajak@gentoo.org>
    Signed-off-by: Zac Medico <zmedico@gentoo.org>

 src/portage_util_file_copy_reflink_linux.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)
Comment 10 Larry the Git Cow gentoo-dev 2021-12-12 08:13:40 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=82fde90c8a9f82d2f5a18424100b0676fcb71f73

commit 82fde90c8a9f82d2f5a18424100b0676fcb71f73
Author:     Sam James <sam@gentoo.org>
AuthorDate: 2021-12-12 08:09:20 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2021-12-12 08:13:31 +0000

    sys-apps/portage: add 3.0.30
    
    Closes: https://bugs.gentoo.org/828844
    Closes: https://bugs.gentoo.org/828966
    Signed-off-by: Sam James <sam@gentoo.org>

 sys-apps/portage/Manifest              |   1 +
 sys-apps/portage/portage-3.0.30.ebuild | 267 +++++++++++++++++++++++++++++++++
 2 files changed, 268 insertions(+)