Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 918688 - sys-kernel/genkernel-4.3.8: Resume from disk corrupts sys-fs/zfs pool
Summary: sys-kernel/genkernel-4.3.8: Resume from disk corrupts sys-fs/zfs pool
Status: UNCONFIRMED
Alias: None
Product: Gentoo Hosted Projects
Classification: Unclassified
Component: genkernel (show other bugs)
Hardware: All Linux
: Normal critical (vote)
Assignee: Gentoo Genkernel Maintainers
URL:
Whiteboard:
Keywords: PATCH
Depends on:
Blocks:
 
Reported: 2023-11-28 09:51 UTC by anatol.rosch
Modified: 2024-01-08 17:16 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
safe-zpool-import.patch (file_918688.txt,730 bytes, patch)
2023-11-28 09:52 UTC, anatol.rosch
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description anatol.rosch 2023-11-28 09:51:29 UTC
An imported ZFS pool must never be modified outside its current session. If a pool has been imported (opened), it must be exported before it can be safely accessed again.

Hibernating a system does not export a pool; it remains imported and must not be modified outside the hibernated system.

Modifying metadata on a pool that has not been exported (closed) will likely corrupt it. This causes data loss.

1. set up a machine with zpool
2. hibernate it
3. boot another os and modify the pool (even clean import/export is sufficient)
4. attempt resuming the original

The resumed system will expect the pool to be intact exactly as when hibernated; if a pool was imported from the outside, it becomes corrupt.

genkernel's initrd does exactly the above. start_volumes imports all ZFS pools before it checks whether we are resuming or not. Importing a pool prior to resume, and then resuming the session where it was already imported, will corrupt it.

start_volumes runs multiple blocks for different volume backends; it does the ZFS related stuff here: https://gitweb.gentoo.org/proj/genkernel.git/tree/defaults/initrd.scripts#n1726 . The entire ZFS block must run after resume checks. The attached patch does not modify the logic of volume init but splits the ZFS block into a separate function. start_volumes will run normally as before without ZFS, and start_zfs will run after do_resume in https://gitweb.gentoo.org/proj/genkernel.git/tree/defaults/linuxrc#n728 . 

the same issue has been reported before, twice, in 

https://bugs.gentoo.org/577484
https://bugs.gentoo.org/827281

and seemingly abandoned after a lengthy (and largely off-topic) discussion.

upstream knows about this quite well:

https://github.com/openzfs/zfs/issues/12842
https://github.com/openzfs/zfs/issues/14118

other distributions identified and fixed an identical issue in their init scripts:

https://github.com/NixOS/nixpkgs/pull/208037

I'd like to point out that this issue has nothing to do with swap on zvol or other exotic cases; this is purely an incorrect sequence of steps in genkernel's volume init script, and the use case is supported by upstream if done right.

Reproducible: Sometimes
Comment 1 anatol.rosch 2023-11-28 09:52:45 UTC
Created attachment 875866 [details, diff]
safe-zpool-import.patch