Hello, I've been asked by ryao to post a gentoo bug for this issue which was reported by myself and a few others on the ZFS on Linux github. It appears that there may be an issue with reflinks during package installation, which can lead to empty/sparse files being written when the package is installed. My personal experience is that this can occur very randomly, however other reporters can trigger this more frequently. Reproducible: Sometimes Steps to Reproduce: 1. Install a gentoo system on ZFS with / and /var/tmp/portage on separate datasets 2. Install/upgrade packages Actual Results: Things break because files wind up being filled with null bytes Expected Results: Regular system operation Users have reported that this appeared in the 0.7.x releases of ZFS, unsure if there were any changes to portage recently which may exacerbate this issue.
Please strace it, in order to check whether it's using copy_file_range or sendfile. For example, see bug 621994, comment #5.
Looks like sendfile, last few lines... strace python -c 'from portage.util.file_copy import copyfile; copyfile("/var/tmp/portage/test.file", "/usr/lib64/test.file")' &> strace.log > lseek(4, 0, SEEK_DATA) = 0 > lseek(4, 0, SEEK_HOLE) = 6 > lseek(4, 0, SEEK_SET) = 0 > sendfile(5, 4, [0] => [6], 6) = 6 > lseek(4, 6, SEEK_DATA) = -1 ENXIO (No such device or address) > lseek(4, 0, SEEK_END) = 6 > ftruncate(5, 6) = 0 > close(5) > close(4) > rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fbc1a40eea0}, {sa_handler=0x7fbc1a76cf80, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fbc1a40eea0}, 8) = 0 > close(3) > exit_group(0)
We haven't had any issue involving sendfile with other filesystems, and the sendfile code is probably used by more users than anything else, since copy_file_range will return EXDEV for the most common configurations. I'm guessing that there's a bug somewhere in the ZFS implementation of sendfile.
It might be worth looking into how ZFS sendfile handles the *offset parameter in the even of an EINTR error (triggered by SIGSTOP/SIGCONT). If it increments *offset without actually copying the bytes, then that would cause the output file to contain null bytes for the range that was supposed to have been copied.
Looking and the sendfile documentation, it looks like portage's interpretation of the *offset parameter is incorrect, since that variable represents the input file offset rather than the output file offset. I'll fix it to use sendfile s return value to measure bytes written to the output file.
You can try this patch from bug 635126: https://patch-diff.githubusercontent.com/raw/gentoo/portage/pull/223.patch Save that patch in /etc/portage/patches/sys-apps/portage/, and make sure you build portage with USE=native-extensions enabled.
A couple of the reporters said this didn't seem to help for them. I can't tell either way. Will keep testing.
I have updated the patch to also use lseek to ensure that the output file is positioned at the correct offset.
I've update the patch again, so that it no longer relies on the file offset of fd_in (eliminating some lseek calls).
Unfortunately I'm still getting sparse files with v5 of Zac Medico's patch (/var/tmp/portage on separate ZFS dataset).
(In reply to George Diamantopoulos from comment #10) > Unfortunately I'm still getting sparse files with v5 of Zac Medico's patch > (/var/tmp/portage on separate ZFS dataset). A problem in the lseek SEEK_DATA/SEEK_HOLE implementation might cause this.
I does appear to be a ZFS issue after all, there's a patch that fixes this for me at https://github.com/zfsonlinux/zfs/issues/3125#issuecomment-339747744 (applies cleanly as a user patch against zfs-kmod-0.7.2).
I think the ZFS folks have a patch that should be sufficient in conjunction with the patch you're proposing for portage.
https://github.com/zfsonlinux/zfs/commit/454365bbaacc153f98d2a3adaf33b13a6183d45d has been merged to the ZOL code base, we should be good now.
Hi, did this patch ever get added to portage?
The portage patch for bug 635126 involving sendfile was released in portage-2.3.13. It doesn't look like there's been an zfs-kmod version bump including the patch from https://github.com/zfsonlinux/zfs/issues/3125.
Fair enough, thanks for the confirmation.
Possible repeat on zfs 0.8.0 https://github.com/zfsonlinux/zfs/issues/8816 Disabling native-extensions seems to fix the issue I haven't had the issue on my systems, but I haven't upgraded to kernel 5.x
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=abf3bbd8488b7eb5177cac9898ccc8ea6d963429 commit abf3bbd8488b7eb5177cac9898ccc8ea6d963429 Author: Georgy Yakovlev <gyakovlev@gentoo.org> AuthorDate: 2019-05-29 22:12:55 +0000 Commit: Georgy Yakovlev <gyakovlev@gentoo.org> CommitDate: 2019-05-29 22:14:29 +0000 sys-fs/zfs-kmod: revbump 0.8.0 with critical patches Issue: https://github.com/zfsonlinux/zfs/issues/8816 Issue: https://github.com/zfsonlinux/zfs/issues/8778 Bug: https://bugs.gentoo.org/635002 Package-Manager: Portage-2.3.67, Repoman-2.3.12 Signed-off-by: Georgy Yakovlev <gyakovlev@gentoo.org> ....0_Fix_integer_overflow_in_get_next_chunk.patch | 32 ++++ .../zfs-kmod/files/0.8.0_revert_Report_holes.patch | 53 ++++++ sys-fs/zfs-kmod/zfs-kmod-0.8.0-r1.ebuild | 178 +++++++++++++++++++++ 3 files changed, 263 insertions(+)
We no longer have affected zfs version in the tree.