Summary: | sys-apps/portage-2.3.84 file_copy failure (gawk emerge doins failed) ** WORK AROUND ** | ||
---|---|---|---|
Product: | Portage Development | Reporter: | Michael South <msouth> |
Component: | Core | Assignee: | Portage team <dev-portage> |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | alex.iris.parker, Etrnls, jstein |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Bug Depends on: | |||
Bug Blocks: | 635020 | ||
Attachments: |
Build log
Fragment of **successful** emerge strace. |
Description
Michael South
2020-02-22 11:37:51 UTC
Wonderful, wonderful news. (deep sarcasm...) Heisenbug... Same behavior with Portage 2.3.79 and 2.3.89. So, still in Portage 2.3.89, I emerged strace (which worked) and did ~ # strace -o portage-2.3.89-emerge-gawk-strace.txt -f emerge -1 gawk ...and it worked without problems. Race condition? BTW, I previously tried and failed with -j1 (both in command line and in make.conf). Tried `emerge -1 gawk' again (still Portage 2.3.89), and it failed. Tried this in file_copy/__init__.py: with open(src, 'rb', buffering=0) as src_file, \ open(dst, 'wb', buffering=0) as dst_file: + time.sleep(1) _file_copy(src_file.fileno(), dst_file.fileno()) and + time.sleep(1) with open(src, 'rb', buffering=0) as src_file, \ open(dst, 'wb', buffering=0) as dst_file: _file_copy(src_file.fileno(), dst_file.fileno()) Slower in install, but still failed. This failed, too: + with open(src, 'rb', buffering=0) as src_file, \ + open(dst, 'wb', buffering=0) as dst_file: + pass + os.remove(dst) with open(src, 'rb', buffering=0) as src_file, \ open(dst, 'wb', buffering=0) as dst_file: _file_copy(src_file.fileno(), dst_file.fileno()) I'll go ahead an upload the (successful) strace. Created attachment 615090 [details]
Fragment of **successful** emerge strace.
The important part of an strace from a *successful* emerge of gawk. There doesn't appear to be any difference in behavior for config.h (which has been troublesome) and cmd.h (which isn't).
This could be due to a bug in your kernel's ext4 driver. Note that 4.17.0 is a relatively old linux kernel version. You should try upgrading to a newer kernel version in order to see if that corrects the problem. I'm seeing this behavior on Linux 5.7.2 and Portage 2.3.101. Still seeing w/ kernel 5.4.0, Portage 2.3.99-r2. ** WORK AROUND ** Works if I disable native extensions: sys-apps/portage -native-extensions Also, latest failure and successful work-round were using same version of Python, either 3.6 or 3.7. I guess this is some problem using reflinks (which I'm not familiar with), but why for this particular file? I'm trying to make a minimal reproduction of the error, but having no luck. I ran this Python code and it worked: from portage.util.file_copy.reflink_linux import file_copy fin = open(b'config.h', 'rb') fout = open(b'/var/tmp/portage/sys-apps/gawk-5.1.0/image/usr/include/awk/config.h', 'wb') file_copy(fin.fileno(), fout.fileno()) A similar doins failure occurs for me with qtgui. The same workaround works. Sure looks like a lseek() bug... I verified its still happening to me. Re-enabled native extensions, tried with Python 3.6 and 3.7. emerge gawk still fails. I put some printfs into portage_util_file_copy_reflink_linux.c. The important bits are: /* Locate the next hole, so that we know when to * stop copying. There is an implicit hole at the * end of the file. This should never result in ENXIO * after SEEK_DATA has succeeded above. */ printf("{seeking... pre errno=%d, offset_data=%ld}", errno, offset_data); { struct stat s; if (!fstat(fd_in, &s)) { printf("{fstat %d .mode=%d, .nlink=%ld, .size=%ld}", fd_in, s.st_mode, s.st_nlink, s.st_size); } else { printf("{fstat %d failed! errno %d}", fd_in, errno); } } offset_hole = lseek(fd_in, offset_data, SEEK_HOLE); if (offset_hole < 0) { printf("{err ret 4: errno=%d, offset_data=%ld, offset_hole=%ld}", errno, offset_data, offset_hole); return -1; } 'emerge gawk' results in: <try><a>{do_lseek_data}{b offset_data=0}{seeking... pre errno=25, offset_data=0}{fstat 3 .mode=33188, .nlink=1, .size=5589}{ok errno=25}<a ok1>{do_lseek_data}<ok> <try><a>{do_lseek_data}{b offset_data=0}{seeking... pre errno=25, offset_data=0}{fstat 3 .mode=33188, .nlink=1, .size=14348}{err ret 4: errno=2, offset_data=0, offset_hole=-1}<a len -1, err 2><some err 2> ERROR:root:Failed to copy file: _parsed_options=Namespace(group=-1, mode=420, owner=-1, preserve_timestamps=False), source=b'config.h', dest_dir=b'/var/tmp/portage/sys-apps/gawk-5.0.1/image/usr/include/awk' Traceback (most recent call last): File "/usr/lib/portage/python3.7/doins.py", line 209, in run copyfile(source, dest) File "/usr/lib/python3.7/site-packages/portage/util/file_copy/__init__.py", line 30, in _optimized_copyfile _file_copy(src_file.fileno(), dst_file.fileno()) FileNotFoundError: [Errno 2] No such file or directory <try><a>{do_lseek_data}{b offset_data=0}{seeking... pre errno=25, offset_data=0}{fstat 3 .mode=33188, .nlink=1, .size=2355}{ok errno=25}<a ok1>{do_lseek_data}<ok> Each "<try>" line is an attempt to copy one file. The files before and after config.h work. For config.h, the fstat(3, ...) works, but lseek(3, ...) is returning -1 and forcing errno to ENOENT. Found a fix... Adding `fsync(fd_in);` before the `lseek(fd_in, offset_data, SEEK_HOLE);` eliminates the error. I think there's a bug in the kernel where seeking holes on unflushed mmapped pages is causing lseek() to return ENOENT rather than the documented behavior of returning the end-of-file offset. Looking at the kernel sources, in fs/iomap/seek.c, there is a function page_cache_seek_hole_data() which says "within unwritten extents, the page cache determines which parts are holes". If it doesn't find a hole it's returning -ENOENT; I suspect this is what's leaking out through lseek(). Hence the guess at fsync(), which appears to work. As noted in a previous comment, the troublesome config.h file is the only in this directory install which is created during the build process, and so may have unwritten extents. Probably don't want to actually fsync(); a better hack-patch would be to check for ENOENT being returned, and force offset_hole to fstat() st_size. (In reply to Michael South from comment #11) > Probably don't want to actually fsync(); a better hack-patch would be to > check for ENOENT being returned, and force offset_hole to fstat() st_size. This needs to get fixed in the kernel. Please open a bug report at https://bugzilla.kernel.org/ and give us a link to it. I'm having trouble creating a simplified test case. Will try to get a strace of the process which is creating the troublesome config.h file so that I can reproduce the same steps. Also, it's possible there's some kind of race condition in what Portage is doing. So it might be a while before I submit any kernel bugs. You might be able to reproduce it with a shell script that runs in a loop creating a temporary directory, unpacking the tarball, and copying the files to another temporary directory with a command like cp -a --reflink=always --sparse=always. It's really helpful if the script uses a trap to clean up after itself so that you never have to clean up after it manually. For example see the script that I attached in bug 641262 comment 36. It's having trouble with config.h, which is created by the build process, and not the files which are from the tarball. Also, I'm using ext4, which I don't think supports reflinks. Nice suggestion, though. Over time I hit this issue again and again. Tracking misfeature down to ext4 when inline_data feature is in place. -- Testcase (using zram for free space playground): # zramctl -a lzo /dev/zram3 -s 1G # zramctl /dev/zram3 NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT /dev/zram3 lzo 1G 0B 0B 0B 16 # mkfs.ext4 -O inline_data,^metadata_csum,^has_journal,^resize_inode,^dir_nlink,^huge_file -m 0 /dev/zram3 # mkdir /tmpzram # mount /dev/zram3 /tmpzram # mount | grep /tmpzra /dev/zram3 on /tmpzram type ext4 (rw,relatime) # chmod 1777 /tmpzram # while true; do PORTAGE_TMPDIR=/tmpzram emerge --oneshot --ask=n gawk; done ... ERROR:root:Failed to copy file: _parsed_options=Namespace(group=-1, owner=-1, mode=420, preserve_timestamps=False), source=b'config.h', dest_dir=b'/tmpzram/portage/sys-apps/gawk-5.1.0/image/usr/include/awk' Traceback (most recent call last): ... -- Workaround without disabling native-extensions and/or inline_data feature is mounting ext4 with nodelalloc # umount /tmpzram # mount -o nodelalloc /dev/zram3 /tmpzram # mount | grep /tmpzra /dev/zram3 on /tmpzram type ext4 (rw,relatime,nodelalloc) # while true; do PORTAGE_TMPDIR=/tmpzram emerge --oneshot --ask=n gawk; done no more strange doins.py failures used kernel version is 5.14.14 These native-extension bugs tend to get reused by people hitting completely new issues. Please report this bug to the kernel folks if it's still happening. |