Summary: | app-emulation/qemu-6.1.0 causes I/O errors in VMs leading to data corruption | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | soundbastlerlive |
Component: | Current packages | Assignee: | Matthias Maier <tamiko> |
Status: | RESOLVED FIXED | ||
Severity: | critical | CC: | ajak, dan, hydrapolic, jasmin+gentoo, maracay, pageexec, sam, virtualization, zlogene |
Priority: | Normal | Keywords: | PMASKED |
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: | qemu-6.1.0-data-corruption.patch |
Description
soundbastlerlive
2021-09-29 17:07:28 UTC
(In reply to soundbastlerlive from comment #0) > after upgrading around 10 gentoo hosts from qemu-6.0.0-r53 to 6.1.0 most VMs > (around 85 of 100, our VMs with PostgreSQL have 100% chance of hitting this) > after some time (few minutes) will have I/O Errors, causing crashes and data > corruption. > The VMs are stored on ZFS volumes. > Downgrading to qemu-6.0.0-r53 instantly fixes this. > Happens on completely different hardware (quad core Xeons to 32C Epyc2). > > I guess this is an upstream bug? Yes, could you report it upstream? Thanks for letting us know, this is nasty. I've had a look and neither Debian nor Fedora seem to be including any patches for something like this, which is a sign it may not have been fixed upstream yet. see https://gitlab.com/qemu-project/qemu/-/issues/649 command to reproduce # root.img and swap.img are ZFS volumes, but I don't think that's relevant # qemu-system-x86_64 -name template5 -smp cores=4,threads=1,sockets=1 -k de -m 6G -vnc 0.0.0.0:1,lossy=on -net nic,macaddr=DE:AD:BE:EF:05:01,model=virtio,netdev=net0 -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -enable-kvm -machine q35,accel=kvm -cpu host -curses -display none -vga virtio -daemonize -rtc base=localtime -kernel /mnt/data1/vm/kernel-5.14.8-gentoo -append 'root=/dev/vda' -drive aio=io_uring,media=disk,discard=on,cache=writeback,if=virtio,format=raw,file=root.img,index=0,throttling.iops-read=125000,throttling.iops-write=25000,throttling.bps-read=536870912,throttling.bps-write=104857600 -drive aio=io_uring,media=disk,discard=on,cache=unsafe,if=virtio,format=raw,file=swap.img,index=1,throttling.iops-read=200000,throttling.bps-read=838860800,throttling.iops-write=50000,throttling.bps-write=209715200 --device virtio-balloon -watchdog i6300esb -watchdog-action reset -chroot /mnt/data1/vm/_TEMPLATE -runas kvm -pidfile _TEMPLATE.pid -monitor unix:monitor,server=on,wait=off -chardev socket,path=guestagent,server=on,wait=off,id=qga0 -device virtio-serial -device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0 The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=1721f84b8e877c2ffce831c621a7e2bcbfa4343a commit 1721f84b8e877c2ffce831c621a7e2bcbfa4343a Author: John Helmert III <ajak@gentoo.org> AuthorDate: 2021-09-29 21:56:35 +0000 Commit: John Helmert III <ajak@gentoo.org> CommitDate: 2021-09-29 22:03:06 +0000 profiles: mask qemu-6.1.0 for data corruption bug Bug: https://bugs.gentoo.org/815379 Signed-off-by: John Helmert III <ajak@gentoo.org> profiles/package.mask | 4 ++++ 1 file changed, 4 insertions(+) VM snapshots via QMP api is broken too: {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-delete", "id": "vmdel-deb10-2021-09-23-10-51-08-YeULbBBM"}, {"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-delete", "id": "vmdel-deb10-2021-09-23-10-51-07-bhPI17Ev"}, {"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-save", "id": "vmsave-deb10-2021-09-23-10-51-04-pK5ompz1", "error": "Error while writing VM state: Unknown error -1"}]} I have no such problems. 8 VMs with moderat i/o on a software raid1 on 2 nvme's. Uh, forgot to mention - ext4. maybe it has to do with either io_uring (see parameters) or the fact it's on zfs volumes? @soundbastlerlive@gmx.at did you try another aio value already? Just to rule out io_uring. I retried qemu-6.1.0-r1 on a low impact host with only a few less important VMs. So far no issues, but I'll retry the other original hosts after some time has passed without issues. Maybe it was a kernel/zfs update since back then, that fixed it or it may still appear on the originally affected hosts. unfortunately still broken, but maybe happens slightly less often :( tested all versions of the last weeks up to host: 5.10.82 vm: 5.15.5 qemu-6.1.0 vs qemu-6.0.1 zfs-2.1.1-r5 sys-fs/zfs-kmod-2.1.1-r4 I have not delved into either issue but I wonder if this is similar to #815469 and qemu is triggering the same issue as cp from newer coreutils. (In reply to Daniel M. Weeks from comment #12) > I have not delved into either issue but I wonder if this is similar to > #815469 and qemu is triggering the same issue as cp from newer coreutils. I was wondering about this as well. Could anyone confirm this issue still occurs with the latest ~arch ZFS? as stated I'm still having the issue with the latest ebuilds zfs-2.1.1-r5 zfs-kmod-2.1.1-r4 AFAIK the coreutils incompatiblity is not fixed, which is why v9 is still downgraded by zfs Created attachment 757131 [details, diff] qemu-6.1.0-data-corruption.patch Anyone tried the patch in the upstream issue? https://gitlab.com/qemu-project/qemu/-/issues/649#note_749175547 The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=aa5822554186cff978a26ec01ebe4a0806ad0ed4 commit aa5822554186cff978a26ec01ebe4a0806ad0ed4 Author: John Helmert III <ajak@gentoo.org> AuthorDate: 2021-12-08 01:19:11 +0000 Commit: John Helmert III <ajak@gentoo.org> CommitDate: 2021-12-08 01:21:11 +0000 app-emulation/qemu: add potential patch for data corruption bug Bug: https://bugs.gentoo.org/815379 Signed-off-by: John Helmert III <ajak@gentoo.org> .../qemu/files/qemu-6.1.0-data-corruption.patch | 114 +++ app-emulation/qemu/qemu-6.1.0-r2.ebuild | 913 +++++++++++++++++++++ 2 files changed, 1027 insertions(+) Please test -r2. Let's let it gestate a few days, and if no problems come up, we can proceed soon :) 6.1.0-r2 works fine for a few days for ~100 VMs on 3 servers now. 6.2.0 supposedly has this patch as well, so an update would be nice. Thanks to everyone involved! I still have corruptions: 6.2.0-r3, kernel 5.16.5, btrfs fs, raw image with nocow io_uring + virtio = i/o error io_uring + sata = i/o error after host reboot native + virtio = i/o error native + sata = OK (In reply to Viktor Kuzmin from comment #19) > I still have corruptions: > > 6.2.0-r3, kernel 5.16.5, btrfs fs, raw image with nocow > > io_uring + virtio = i/o error > io_uring + sata = i/o error after host reboot > native + virtio = i/o error > native + sata = OK I think you may need to file a new bug both in Gentoo and upstream. Thank you. |