577484 – sys-kernel/genkernel-3.4.52.4: Resume from disk corrupts ZFS pool

Bug 577484 - sys-kernel/genkernel-3.4.52.4: Resume from disk corrupts ZFS pool

Summary: sys-kernel/genkernel-3.4.52.4: Resume from disk corrupts ZFS pool

Status:	CONFIRMED

Alias:	None

Product:	Gentoo Hosted Projects
Classification:	Unclassified
Component:	genkernel (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Gentoo Genkernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-03-15 15:30 UTC by Harry
Modified:	2023-12-14 08:07 UTC (History)
CC List:	4 users (show)

See Also:	918688
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Harry 2016-03-15 15:30:57 UTC

The initramfs generated by genkernel can corrupt a zpool, when using hibernate/resume.

My Setup:

Boot partition
Swap partition, encrypted with luks.
Root partition, encrypted with luks. Consists of one zpool with rootfs volume and various other volumes.

When booting the system, the order of the relevant commands (linurc and initrd.scripts) is like this:

1 LuksOpen the root partition

2 import the zpool with this command: /sbin/zpool import -N ${ZPOOL_FORCE} "${ZFS_POOL}" (That means, the zpool/rootfs is not mounted, only the pool is imported at that moment.)

3 LuksOpen the swap partition

4 If hibernation image is found
  4a system is resumed from swap.
else
  4b mount zpool/rootfs
  5 normal system boot

The problem is step 2: Importing a zfs pool enables full read/write mode, even if nothing is mounted. Although user space cannot access any data on the pool and therefore cannot write any data, ZFS itself will write data while housekeeping. For example rolling back unfinished transactions or correcting errors found by scrubing.

That means in the timespan between step 2 and 4a ZFS might have changed the pool data in various ways. As step 3 is most possible waiting for a passphrase to enter, this timespan can be quite long.

After the system has been resumed in step 4a, the data on disk is not consistent with the filesystem data in RAM. And this obviously must never happen, as it will lead to data corruption sooner or later.

On my system I did a quick and dirty patch to test the right boot sequence. I deactivated step 2 and replaced 4b by a normal zpool import which mounts all volumes, including rootfs. For my system that's okay, but genkernel should be changed by someone who oversees the whole system and fixes it in a more general way. I think swapping step 3/4a with 2 should be enough, but that was too complicated for me without producing nasty side effects.

Have a nice day
Harry



Reproducible: Always

Comment 1 Richard Yao (RETIRED) gentoo-dev

2016-03-15 19:49:04 UTC

Your analysis is largely correct. Unfortunately, the hibernation code is protected by GPL symbol exports, so I am not able to hook into it to make the resume process sane (particularly for swap on a zvol). What usually happens is the system crashes. The crash happens so soon that data is not lost due to the ability to roll back to a previous transaction group commit, but having to rollback is a bad situation.

Your suggestion to swap step 3/4a with 2 should be enough to handle this. It would break any expectation of resuming from hibernation on a zvol, but unfortunately, that is unlikely to work under the current restrictions on non-GPL kernel modules, so I will go that route in the near future.

Comment 2 Richard Yao (RETIRED) gentoo-dev

2016-03-15 21:42:10 UTC

After giving this some more thought, I think my initial response was wrong. You are right about the writes occurring, but how writes to ZFS work and the way suspend-to-disk/resume works should mean that the exact scenario you describe does not happen, although asynchronous destroy + your scenario could cause something like it.

First, ZFS does not rollback unfinished transactions. If a transaction does not finish the two-stage transaction commit, ZFS does not recognize it to have ever occurred. Barring abnormal circumstances, the two-stage transaction commit occurs only every 5 seconds by default. Any changes that are made during that window that are not committed to ZIL are lost should the transaction commit not complete. There is also a window of about 2 transaction group commits during which ZFS will make certain that it does not reuse freed space. Furthermore, the two stage transaction commit involves updating 4 labels on each disk that each contain an uberblock history. It is treated as a circular buffer and it is only read on a non-verbatim import operation (genkernel doing these at every boot is a bug that I need to fix). In addition, ZIL records are not processed unless a mount operation is done (and they are presently processed synchronously with it).

Any changes ordinarily made during the resume process (such as dirtying the disk labels or processing a scrub) will not matter because the restored state's references to on-disk structures will still be valid because not many transaction group commits will occur. In which case, things would seem to be fine. In addition, the only changes are the label update, scrubs resuming (with minimal changes made on every 5 seconds by default) and asynchronous destroy. The only case where we risk invalidating the old state's references to on-disk structures is the asynchronous destroy case. In that case, asynchronous destroy will block the import process until it finishes, allowing an unbound number of transaction group commits to occur. When the old state is restored, the on-disk references could potentially no longer be valid, the asynchronous destroy would resume and presumably fail very quickly (most likely before the next transaction group commit). Some minor self healing might occur that could theoretically (although with a low probability due to ditto blocks) cause a situation where the state prior to the boot is no longer completely undamaged either.

There is another case to consider, which is what happens if the worker threads are frozen while waiting on an I/O completion that is lost. In that case, the pool will hang on resume and you will need to do a regular boot. No amount of patching genkernel will help in this case, but on the bright side, no data written to stable storage would be lost. This is theoretically possible because the ZoL project is cooperating with mainline Linux GPL symbol export restrictions and the symbols needed to stop I/O at the right time are GPL exported. It is not possible to fix it without violating the GPL symbol export restrictions. Whether or not this can occur needs additional analysis of the kernel's suspend-to-disk process, but I am certain that the safety mechanisms in ZFS meant to protect against this are disabled by the kernel's GPL symbol export restrictions, which means this is an open question.

Going back to the asynchronous destroy case, it is unlikely there would be pool damage after doing a normal boot, but unlikely is not a good guarantee. Moving zpool import after 4a as you suggested should fix that (unless the swap block device is a zvol) by virtue of avoiding any risk that an asynchronous destroy operation would occur when we can restore the previous state (at which case, the initramfs should no longer be executing). There are some other non-resume cases that need special consideration (e.g. encrypted swap on a zvol) to avoid breaking when the patch is written, but that should be doable.

Since this is rather improbable and I earmarked my day to work on the kernel module and userland code packaging, I am going to consider this to be a medium priority issue. The earliest I will be able to work on this is tomorrow, so I am not marking this IN_PROGRESS just yet.

Comment 3 Harry 2016-03-16 13:51:36 UTC

Wow, that's quite a lot of information. And some of it is quite outside my knowledge base...

You think that normally the ZFS pool survives unharmed, just the resumed Linux crashes. I hope you are right.

But I can assure you it crashes. That's the reason I found this bug. But I have no precise knowledge about the side effects, that might corrupt the ZFS pool.

Therefor my thoughts were much more simple:
Someone (in that case the initramfs system) is writing data to a disk and then a system is resumed that uses that same disk. This obviously is not a good idea.

Then I thought about async destroy, scrubbing, resilvering and maybe some other nice ZFS feature, I do not know about and I felt a little nervous. ;-)

> Any changes ordinarily made during the resume process (such as dirtying the disk labels or processing a scrub) will not matter because the restored state's references to on-disk structures will still be valid because not many transaction group commits will occur.

Even if did a "zfs destroy <1 TB volume>" just before hibernation and then waited one hour at the cryptsetup prompt before entering the passphrase to unlock swap?

And I noticed sometimes in the past, that a "zpool import" worked much more than 10 minutes until it began mounting volumes after a system crash. This was without the "-N" option however.

> In that case, asynchronous destroy will block the import process until it finishes, allowing an unbound number of transaction group commits to occur.

You mean, the async destroy of the 1 TB will delay the reboot? So my example of waiting one hour at the cryptsetup prompt is irrelevant, as the cryptsetup will be started only after the destroy has been finished?

> There is another case to consider, which is what happens if the worker threads are frozen while waiting on an I/O completion that is lost. In that case, the pool will hang on resume and you will need to do a regular boot. No amount of patching genkernel will help in this case, but on the bright side, no data written to stable storage would be lost.

If I understand you right that means that hibernate/resume will currently not work reliable when using ZFS. That's bad. Can you estimate how often this bug will trigger?

Of course genkernel cannot fix that. It's a kernel or ZFS module bug.

> Moving zpool import after 4a as you suggested should fix that

I wonder if it also makes sense, that step 1 is also moved behing step 4. So, first ask for the passphrase for swap. If the system can be resumed, there is no need for LuksOpen the rootfs.

This clearly breaks the szenario I normally use: swap partion as logical volume in the same volume group as the rootfs. Here you LuksOpen the volume group and resume from swap without any additional pass phrase.

If the two cases could be differentiated clearly, the user would only need to enter one passphrase for a hibernated system.

> unless the swap block device is a zvol

But swap as zvol would result in the same sequence:

1 LuksOpen the root partition/zpool
2 zpool import (possibly blocking until volume destroy has been finished)
4a Resume from zvol

And this sequence cannot be changed, as the pool has to be imported to get access to the swap zvol.

So in my opinion, if you put swap on a zvol in the same pool the rootfs resides you must not use that swap for hibernate/resume.