Created attachment 559460 [details] emerge --info This is a weird problem. Recent versions of portage fails to install some packages on 32-bit systems. The problem is not consistent, but has been tested on various kernels and various late versions of portage. We have not observed the same problem on 64-bit kernels and system. Likewise, we have not observed this problem when installing 32-bit containers on 64-bit kernels (shared kernel). Emerging shadow as an example here, the emerge ends during install with "doins failed": >>> Install shadow-4.6 into /var/tmp/portage/sys-apps/shadow-4.6/image/ category sys-apps [...] make[1]: Leaving directory '/var/tmp/portage/sys-apps/shadow-4.6/work/shadow-4.6' ERROR:root:Failed to copy file: _parsed_options=Namespace(group=-1, mode=384, owner=-1, preserve_timestamps=False), source=b'/var/tmp/portage/sys-apps/shadow-4.6/files/default/useradd', dest_dir=b'/var/tmp/portage/sys-apps/shadow-4.6/image/etc/default' Traceback (most recent call last): File "/usr/lib/portage/python3.4/doins.py", line 209, in run copyfile(source, dest) File "/usr/lib/python3.4/site-packages/portage/util/file_copy/__init__.py", line 30, in _optimized_copyfile _file_copy(src_file.fileno(), dst_file.fileno())OSError: [Errno 117] Structure needs cleaning * ERROR: sys-apps/shadow-4.6::gentoo failed (install phase): * doins failed At the same time, dmesg reports: "EXT4-fs error (device sda3): ext4_map_blocks:568: inode #1572883: block 1937055779: comm python3.4m: lblock 0 mapped to illegal pblock (length 1)" Even with an fsck and a reboot, the same error returns if the operation is retried if the error has happened once on that system. However, by applying the patch below, installs then works without errors. Ref: https://forums.gentoo.org/viewtopic-p-8292664.html : "I've opened file /usr/lib/python3.6/site-packages/portage/util/file_copy/__init__.py and replaced try: from portage.util.file_copy.reflink_linux import file_copy as _file_copy except ImportError: _file_copy = None with #try: # from portage.util.file_copy.reflink_linux import file_copy as _file_copy #except ImportError: _file_copy = None Now I can install packages again." We believe this is a strange bug that has been introduced in some late versions of portage, as it did not happen on equal systems in spring 2018.
There have not been any recent portage changes that would trigger this, but if you've recently enabled USE=native-extensions for portage then that would trigger it. From your dmesg error, it looks like copy_file_range triggers the errno 117 (EUCLEAN). Hopefully portage can handle copy_file_range EUCLEAN errors like in handles EOPNOTSUPP for bug 641088 here: https://gitweb.gentoo.org/proj/portage.git/commit/?id=dad9cce8a1e2360e8483e0f78e29e20bd5fdce49
We could use an strace log created like this to verify which syscall triggers the EUCLEAN error: strace python -c 'from portage.util.file_copy import copyfile; copyfile("/var/tmp/portage/sys-apps/shadow-4.6/files/default/useradd", "/var/tmp/portage/sys-apps/shadow-4.6/image/etc/default")' > strace.log 2>&1
Patch posted for review: https://archives.gentoo.org/gentoo-portage-dev/message/297dd66077eb6d9dcb46759ef6c7001c https://github.com/gentoo/portage/pull/392
Created attachment 559582 [details] strace.log strace.log as result of: strace python -c 'from portage.util.file_copy import copyfile; copyfile("/var/tmp/portage/sys-apps/shadow-4.6/files/default/useradd", "/var/tmp/portage/sys-apps/shadow-4.6/image/etc/default")' > strace.log 2>&1
The error in dmesg indicate a filesystem corruption. Ext4 doesn't use EUCLEAN directly but it #define EFSCORRUPTED as EUCLEAN so the errno is thrown when ext4 detects FS corruption. Filesystem corruptions can survive fsck but since the problem only happens on 32-bit there might be a kernel bug involved. 32-bit gets a lot less testing, especially when using unusual calls like copy_file_range. Try to reproduce the problem on a new clean filesystem and if that also fails report the problem to the ext4 maintainers.
OK. Still, there are several peculiarities with this: * We have reproduced the problem several times lately, on several types of disks, new and old. * We have reproduced the problem on several kernels, ranging in the interval from 4.9.78-ish to 4.9.135. * There are other similar reports with quite different kernels, e.g. the post with the fix seems to be the same problem, but quite different kernel (3.2.40). * The problem was not present in spring 2018, with same kernel. * Other posts suggest that similar problems could be related to xattrs and/or acl support (e.g. https://forums.gentoo.org/viewtopic-t-1073524-start-0.html ), but we have tested with and without xattrs/acl in various configurations without conclusive success. And lastly: * The fix mentioned in the forum post by uncommenting those lines of portage code, ALWAYS makes the problem go away. * Downgrading to portage-2.3.8 also makes the problem go away (but has other issues) This seems to be the same bug: https://bugs.gentoo.org/672212 May be related: https://forums.gentoo.org/viewtopic-t-1075756.html
(In reply to Zac Medico from comment #1) > There have not been any recent portage changes that would trigger this, but > if you've recently enabled USE=native-extensions for portage then that would > trigger it. > > From your dmesg error, it looks like copy_file_range triggers the errno 117 > (EUCLEAN). Hopefully portage can handle copy_file_range EUCLEAN errors like > in handles EOPNOTSUPP for bug 641088 here: > > https://gitweb.gentoo.org/proj/portage.git/commit/ > ?id=dad9cce8a1e2360e8483e0f78e29e20bd5fdce49 I can confirm that unsetting native-extensions also makes the problem go away. I cannot recall that we have touched that use flag manually. However, during the autumn, the profile has been updated from 13 to 17.0. Is it safe to unset native-extensions for portage permanently? The description says: "Compiles native "C" extensions (speedups, instead of using python backup code). Currently includes libc-locales. This should only be temporarily disabled for some bootstrapping operations. Cross-compilation is not supported."
(In reply to Tor Rune Skoglund from comment #6) > This seems to be the same bug: https://bugs.gentoo.org/672212 > > May be related: https://forums.gentoo.org/viewtopic-t-1075756.html Those are separate issues because they have different errno values. (In reply to Tor Rune Skoglund from comment #7) > Is it safe to unset native-extensions for portage permanently? The > description says: > > "Compiles native "C" extensions (speedups, instead of using python backup > code). Currently includes libc-locales. This should only be temporarily > disabled for some bootstrapping operations. Cross-compilation is not > supported." Yes it's safe to disable native-extensions permanently, since there will always be a sane fallback. However, it would be nice to find out the root cause of your problem an have it reported to the kernel developers if appropriate.
It would be interesting to see if the problem is reproducible with the latest LTS kernel, currently 4.19.x.
(In reply to Zac Medico from comment #9) > It would be interesting to see if the problem is reproducible with the > latest LTS kernel, currently 4.19.x. We can probably test that, but it might take some time...
Got the issue again this time with a ZFS 2.1.1 (+ Linux kernel 5.16.0), not sure what triggered the issue this time however the fix remains the same: comment the try/except block at the begining of in portage/util/file_copy/__init__.py to leave only _file_copy = None
oops...! Sorry... messed up with my browser. Just ignore my last comment.
(In reply to Tor Rune Skoglund from comment #10) > (In reply to Zac Medico from comment #9) > > It would be interesting to see if the problem is reproducible with the > > latest LTS kernel, currently 4.19.x. > > We can probably test that, but it might take some time... Closing.