635002 – sys-fs/zfs-kmod: ZFS on Linux reflink potentially causes empty/sparse files to be installed

Bug 635002 - sys-fs/zfs-kmod: ZFS on Linux reflink potentially causes empty/sparse files to be installed

Summary: sys-fs/zfs-kmod: ZFS on Linux reflink potentially causes empty/sparse files t...

Status:	RESOLVED FIXED

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Core (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Richard Yao (RETIRED)

URL:	https://github.com/zfsonlinux/zfs/iss...
Whiteboard:
Keywords:

Depends on:
Blocks:	635020
	Show dependency tree

Reported:	2017-10-21 15:50 UTC by Chris Henhawke
Modified:	2020-01-17 04:44 UTC (History)
CC List:	3 users (show)

See Also:	635126
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Chris Henhawke 2017-10-21 15:50:50 UTC

Hello, I've been asked by ryao to post a gentoo bug for this issue which was reported by myself and a few others on the ZFS on Linux github.  It appears that there may be an issue with reflinks during package installation, which can lead to empty/sparse files being written when the package is installed.  My personal experience is that this can occur very randomly, however other reporters can trigger this more frequently.

Reproducible: Sometimes

Steps to Reproduce:
1. Install a gentoo system on ZFS with / and /var/tmp/portage on separate datasets
2. Install/upgrade packages

Actual Results:  
Things break because files wind up being filled with null bytes

Expected Results:  
Regular system operation

Users have reported that this appeared in the 0.7.x releases of ZFS, unsure if there were any changes to portage recently which may exacerbate this issue.

Comment 1 Zac Medico gentoo-dev

2017-10-21 20:48:07 UTC

Please strace it, in order to check whether it's using copy_file_range or sendfile. For example, see bug 621994, comment #5.

Comment 2 Chris Henhawke 2017-10-22 00:21:09 UTC

Looks like sendfile, last few lines...

strace python -c 'from portage.util.file_copy import copyfile; copyfile("/var/tmp/portage/test.file", "/usr/lib64/test.file")' &> strace.log

> lseek(4, 0, SEEK_DATA)                  = 0
> lseek(4, 0, SEEK_HOLE)                  = 6
> lseek(4, 0, SEEK_SET)                   = 0
> sendfile(5, 4, [0] => [6], 6)           = 6
> lseek(4, 6, SEEK_DATA)                  = -1 ENXIO (No such device or address)
> lseek(4, 0, SEEK_END)                   = 6
> ftruncate(5, 6)                         = 0
> close(5)
> close(4)
> rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fbc1a40eea0}, {sa_handler=0x7fbc1a76cf80, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fbc1a40eea0}, 8) = 0
> close(3)
> exit_group(0)

Comment 3 Zac Medico gentoo-dev

2017-10-23 00:54:47 UTC

We haven't had any issue involving sendfile with other filesystems, and the sendfile code is probably used by more users than anything else, since copy_file_range will return EXDEV for the most common configurations.

I'm guessing that there's a bug somewhere in the ZFS implementation of sendfile.

Comment 4 Zac Medico gentoo-dev

2017-10-23 01:06:53 UTC

It might be worth looking into how ZFS sendfile handles the *offset parameter in the even of an EINTR error (triggered by SIGSTOP/SIGCONT). If it increments *offset without actually copying the bytes, then that would cause the output file to contain null bytes for the range that was supposed to have been copied.

Comment 5 Zac Medico gentoo-dev

2017-10-23 01:18:42 UTC

Looking and the sendfile documentation, it looks like portage's interpretation of the *offset parameter is incorrect, since that variable represents the input file offset rather than the output file offset. I'll fix it to use sendfile
s return value to measure bytes written to the output file.

Comment 6 Zac Medico gentoo-dev

2017-10-23 02:24:42 UTC

You can try this patch from bug 635126:

https://patch-diff.githubusercontent.com/raw/gentoo/portage/pull/223.patch

Save that patch in /etc/portage/patches/sys-apps/portage/, and make sure you build portage with USE=native-extensions enabled.

Comment 7 Chris Henhawke 2017-10-23 23:58:49 UTC

A couple of the reporters said this didn't seem to help for them.  I can't tell either way.  Will keep testing.

Comment 8 Zac Medico gentoo-dev

2017-10-24 10:03:52 UTC

I have updated the patch to also use lseek to ensure that the output file is positioned at the correct offset.

Comment 9 Zac Medico gentoo-dev

2017-10-24 18:43:59 UTC

I've update the patch again, so that it no longer relies on the file offset of fd_in (eliminating some lseek calls).

Comment 10 George Diamantopoulos 2017-10-26 12:25:18 UTC

Unfortunately I'm still getting sparse files with v5 of Zac Medico's patch (/var/tmp/portage on separate ZFS dataset).

Comment 11 Zac Medico gentoo-dev

2017-10-27 20:53:34 UTC

(In reply to George Diamantopoulos from comment #10)
> Unfortunately I'm still getting sparse files with v5 of Zac Medico's patch
> (/var/tmp/portage on separate ZFS dataset).

A problem in the lseek SEEK_DATA/SEEK_HOLE implementation might cause this.

Comment 12 George Diamantopoulos 2017-11-01 12:48:33 UTC

I does appear to be a ZFS issue after all, there's a patch that fixes this for me at https://github.com/zfsonlinux/zfs/issues/3125#issuecomment-339747744 (applies cleanly as a user patch against zfs-kmod-0.7.2).

Comment 13 Chris Henhawke 2017-11-02 13:07:45 UTC

I think the ZFS folks have a patch that should be sufficient in conjunction with the patch you're proposing for portage.

Comment 14 Chris Henhawke 2017-11-15 18:22:33 UTC

https://github.com/zfsonlinux/zfs/commit/454365bbaacc153f98d2a3adaf33b13a6183d45d has been merged to the ZOL code base, we should be good now.

Comment 15 Chris Henhawke 2017-12-18 05:59:20 UTC

Hi, did this patch ever get added to portage?

Comment 16 Zac Medico gentoo-dev

2017-12-18 06:21:03 UTC

The portage patch for bug 635126 involving sendfile was released in portage-2.3.13.

It doesn't look like there's been an zfs-kmod version bump including the patch from https://github.com/zfsonlinux/zfs/issues/3125.

Comment 17 Chris Henhawke 2017-12-18 06:48:17 UTC

Fair enough, thanks for the confirmation.

Comment 18 Chris Henhawke 2019-05-26 06:17:50 UTC

Possible repeat on zfs 0.8.0 https://github.com/zfsonlinux/zfs/issues/8816

Disabling native-extensions seems to fix the issue

I haven't had the issue on my systems, but I haven't upgraded to kernel 5.x

Comment 19 Larry the Git Cow gentoo-dev

2019-05-29 22:15:03 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=abf3bbd8488b7eb5177cac9898ccc8ea6d963429

commit abf3bbd8488b7eb5177cac9898ccc8ea6d963429
Author:     Georgy Yakovlev <gyakovlev@gentoo.org>
AuthorDate: 2019-05-29 22:12:55 +0000
Commit:     Georgy Yakovlev <gyakovlev@gentoo.org>
CommitDate: 2019-05-29 22:14:29 +0000

    sys-fs/zfs-kmod: revbump 0.8.0 with critical patches
    
    Issue: https://github.com/zfsonlinux/zfs/issues/8816
    Issue: https://github.com/zfsonlinux/zfs/issues/8778
    Bug: https://bugs.gentoo.org/635002
    Package-Manager: Portage-2.3.67, Repoman-2.3.12
    Signed-off-by: Georgy Yakovlev <gyakovlev@gentoo.org>

 ....0_Fix_integer_overflow_in_get_next_chunk.patch |  32 ++++
 .../zfs-kmod/files/0.8.0_revert_Report_holes.patch |  53 ++++++
 sys-fs/zfs-kmod/zfs-kmod-0.8.0-r1.ebuild           | 178 +++++++++++++++++++++
 3 files changed, 263 insertions(+)

Comment 20 Georgy Yakovlev archtester

2019-06-15 08:58:31 UTC

We no longer have affected zfs version in the tree.