Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 674332 - portage fails sometimes with "Failed to copy file" on 32-bit systems
Summary: portage fails sometimes with "Failed to copy file" on 32-bit systems
Status: RESOLVED NEEDINFO
Alias: None
Product: Portage Development
Classification: Unclassified
Component: Core (show other bugs)
Hardware: x86 Linux
: Normal normal
Assignee: Portage team
URL:
Whiteboard:
Keywords: PATCH
Depends on:
Blocks: 635020
  Show dependency tree
 
Reported: 2019-01-02 13:17 UTC by Tor Rune Skoglund
Modified: 2023-01-17 02:40 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
emerge --info (emerge-info.txt,5.76 KB, text/plain)
2019-01-02 13:17 UTC, Tor Rune Skoglund
Details
strace.log (strace.log,42.16 KB, text/x-log)
2019-01-03 09:59 UTC, Tor Rune Skoglund
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tor Rune Skoglund 2019-01-02 13:17:54 UTC
Created attachment 559460 [details]
emerge --info

This is a weird problem. Recent versions of portage fails to install some packages on 32-bit systems. The problem is not consistent, but has been tested on various kernels and various late versions of portage.

We have not observed the same problem on 64-bit kernels and system. Likewise, we have not observed this problem when installing 32-bit containers on 64-bit kernels (shared kernel).

Emerging shadow as an example here, the emerge ends during install with "doins failed":

>>> Install shadow-4.6 into /var/tmp/portage/sys-apps/shadow-4.6/image/ category sys-apps
[...]
make[1]: Leaving directory '/var/tmp/portage/sys-apps/shadow-4.6/work/shadow-4.6'
ERROR:root:Failed to copy file: _parsed_options=Namespace(group=-1, mode=384, owner=-1, preserve_timestamps=False), source=b'/var/tmp/portage/sys-apps/shadow-4.6/files/default/useradd', dest_dir=b'/var/tmp/portage/sys-apps/shadow-4.6/image/etc/default'
Traceback (most recent call last):  
  File "/usr/lib/portage/python3.4/doins.py", line 209, in run    
    copyfile(source, dest)
  File "/usr/lib/python3.4/site-packages/portage/util/file_copy/__init__.py", line 30, in _optimized_copyfile
    _file_copy(src_file.fileno(), dst_file.fileno())OSError: [Errno 117] Structure needs cleaning
 * ERROR: sys-apps/shadow-4.6::gentoo failed (install phase):
 *   doins failed

At the same time, dmesg reports: 

"EXT4-fs error (device sda3): ext4_map_blocks:568: inode #1572883: block 1937055779: comm python3.4m: lblock 0 mapped to illegal pblock (length 1)"

Even with an fsck and a reboot, the same error returns if the operation is retried if the error has happened once on that system. However, by applying the patch below, installs then works without errors.

Ref: https://forums.gentoo.org/viewtopic-p-8292664.html :

"I've opened file /usr/lib/python3.6/site-packages/portage/util/file_copy/__init__.py and replaced 

try: 
       from portage.util.file_copy.reflink_linux import file_copy as _file_copy 
except ImportError: 
       _file_copy = None 

with 

#try: 
#       from portage.util.file_copy.reflink_linux import file_copy as _file_copy 
#except ImportError: 
_file_copy = None 

Now I can install packages again."

We believe this is a strange bug that has been introduced in some late versions of portage, as it did not happen on equal systems in spring 2018.
Comment 1 Zac Medico gentoo-dev 2019-01-02 21:19:36 UTC
There have not been any recent portage changes that would trigger this, but if you've recently enabled USE=native-extensions for portage then that would trigger it.

From your dmesg error, it looks like copy_file_range triggers the errno 117 (EUCLEAN). Hopefully portage can handle copy_file_range EUCLEAN errors like in handles EOPNOTSUPP for bug 641088 here:

https://gitweb.gentoo.org/proj/portage.git/commit/?id=dad9cce8a1e2360e8483e0f78e29e20bd5fdce49
Comment 2 Zac Medico gentoo-dev 2019-01-02 22:09:19 UTC
We could use an strace log created like this to verify which syscall triggers the EUCLEAN error:

strace python -c 'from portage.util.file_copy import copyfile; copyfile("/var/tmp/portage/sys-apps/shadow-4.6/files/default/useradd", "/var/tmp/portage/sys-apps/shadow-4.6/image/etc/default")' > strace.log 2>&1
Comment 4 Tor Rune Skoglund 2019-01-03 09:59:22 UTC
Created attachment 559582 [details]
strace.log

strace.log as result of:
strace python -c 'from portage.util.file_copy import copyfile; copyfile("/var/tmp/portage/sys-apps/shadow-4.6/files/default/useradd", "/var/tmp/portage/sys-apps/shadow-4.6/image/etc/default")' > strace.log 2>&1
Comment 5 Thomas Lindroth 2019-01-03 11:19:04 UTC
The error in dmesg indicate a filesystem corruption. Ext4 doesn't use EUCLEAN directly but it #define EFSCORRUPTED as EUCLEAN so the errno is thrown when ext4 detects FS corruption.

Filesystem corruptions can survive fsck but since the problem only happens on 32-bit there might be a kernel bug involved. 32-bit gets a lot less testing, especially when using unusual calls like copy_file_range. Try to reproduce the problem on a new clean filesystem and if that also fails report the problem to the ext4 maintainers.
Comment 6 Tor Rune Skoglund 2019-01-03 11:51:32 UTC
OK.

Still, there are several peculiarities with this: 

* We have reproduced the problem several times lately, on several types of disks, new and old. 
* We have reproduced the problem on several kernels, ranging in the interval from 4.9.78-ish to 4.9.135.
* There are other similar reports with quite different kernels, e.g. the post with the fix seems to be the same problem, but quite different kernel (3.2.40). 
* The problem was not present in spring 2018, with same kernel.
* Other posts suggest that similar problems could be related to xattrs and/or acl support (e.g. https://forums.gentoo.org/viewtopic-t-1073524-start-0.html ), but we have tested with and without xattrs/acl in various configurations without conclusive success.

And lastly: 

* The fix mentioned in the forum post by uncommenting those lines of portage code, ALWAYS makes the problem go away. 
* Downgrading to portage-2.3.8 also makes the problem go away (but has other issues)

This seems to be the same bug: https://bugs.gentoo.org/672212

May be related: https://forums.gentoo.org/viewtopic-t-1075756.html
Comment 7 Tor Rune Skoglund 2019-01-03 12:31:39 UTC
(In reply to Zac Medico from comment #1)
> There have not been any recent portage changes that would trigger this, but
> if you've recently enabled USE=native-extensions for portage then that would
> trigger it.
> 
> From your dmesg error, it looks like copy_file_range triggers the errno 117
> (EUCLEAN). Hopefully portage can handle copy_file_range EUCLEAN errors like
> in handles EOPNOTSUPP for bug 641088 here:
> 
> https://gitweb.gentoo.org/proj/portage.git/commit/
> ?id=dad9cce8a1e2360e8483e0f78e29e20bd5fdce49

I can confirm that unsetting native-extensions also makes the problem go away.

I cannot recall that we have touched that use flag manually. However, during the autumn, the profile has been updated from 13 to 17.0. 

Is it safe to unset native-extensions for portage permanently? The description says: 

"Compiles native "C" extensions (speedups, instead of using python backup code). Currently includes libc-locales. This should only be temporarily disabled for some bootstrapping operations. Cross-compilation is not supported."
Comment 8 Zac Medico gentoo-dev 2019-01-04 00:31:31 UTC
(In reply to Tor Rune Skoglund from comment #6)
> This seems to be the same bug: https://bugs.gentoo.org/672212
> 
> May be related: https://forums.gentoo.org/viewtopic-t-1075756.html

Those are separate issues because they have different errno values.

(In reply to Tor Rune Skoglund from comment #7)
> Is it safe to unset native-extensions for portage permanently? The
> description says: 
> 
> "Compiles native "C" extensions (speedups, instead of using python backup
> code). Currently includes libc-locales. This should only be temporarily
> disabled for some bootstrapping operations. Cross-compilation is not
> supported."

Yes it's safe to disable native-extensions permanently, since there will always be a sane fallback. However, it would be nice to find out the root cause of your problem an have it reported to the kernel developers if appropriate.
Comment 9 Zac Medico gentoo-dev 2019-01-04 02:20:54 UTC
It would be interesting to see if the problem is reproducible with the latest LTS kernel, currently 4.19.x.
Comment 10 Tor Rune Skoglund 2019-01-04 12:11:25 UTC
(In reply to Zac Medico from comment #9)
> It would be interesting to see if the problem is reproducible with the
> latest LTS kernel, currently 4.19.x.

We can probably test that, but it might take some time...
Comment 11 Adrien Dessemond 2022-01-12 13:19:53 UTC Comment hidden (obsolete)
Comment 12 Adrien Dessemond 2022-01-12 13:21:42 UTC Comment hidden (obsolete)
Comment 13 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2023-01-17 02:40:49 UTC
(In reply to Tor Rune Skoglund from comment #10)
> (In reply to Zac Medico from comment #9)
> > It would be interesting to see if the problem is reproducible with the
> > latest LTS kernel, currently 4.19.x.
> 
> We can probably test that, but it might take some time...

Closing.