after upgrading around 10 gentoo hosts from qemu-6.0.0-r53 to 6.1.0 most VMs (around 85 of 100, our VMs with PostgreSQL have 100% chance of hitting this) after some time (few minutes) will have I/O Errors, causing crashes and data corruption. The VMs are stored on ZFS volumes. Downgrading to qemu-6.0.0-r53 instantly fixes this. Happens on completely different hardware (quad core Xeons to 32C Epyc2). I guess this is an upstream bug? Reproducible: Always Steps to Reproduce: 1. upgrade qemu-6.1.0 2. keep gentoo VMs on ZFS running for some time 3. VMs will have I/O errors Actual Results: [ 1503.559878] blk_update_request: I/O error, dev vda, sector 23056464 op 0x1:(WRITE) flags 0x4800 phys_seg 254 prio class 0 [ 1503.559881] blk_update_request: I/O error, dev vda, sector 23058496 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0 [ 1503.559883] EXT4-fs warning (device vda): ext4_end_bio:342: I/O error 10 writing to inode 656425 starting block 2882314) [ 1503.559963] blk_update_request: I/O error, dev vda, sector 23058512 op 0x1:(WRITE) flags 0x4800 phys_seg 254 prio class 0 [ 1503.559965] blk_update_request: I/O error, dev vda, sector 23060544 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0 [ 1503.559966] EXT4-fs warning (device vda): ext4_end_bio:342: I/O error 10 writing to inode 656425 starting block 2882570) [ 1503.560033] blk_update_request: I/O error, dev vda, sector 23060560 op 0x1:(WRITE) flags 0x4800 phys_seg 254 prio class 0 [ 1503.560035] blk_update_request: I/O error, dev vda, sector 23062600 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0 [ 1503.560036] EXT4-fs warning (device vda): ext4_end_bio:342: I/O error 10 writing to inode 656425 starting block 2882827) [ 1503.560100] blk_update_request: I/O error, dev vda, sector 23062616 op 0x1:(WRITE) flags 0x4800 phys_seg 254 prio class 0 [ 1503.560102] blk_update_request: I/O error, dev vda, sector 23064664 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0 [ 1503.560103] EXT4-fs warning (device vda): ext4_end_bio:342: I/O error 10 writing to inode 656425 starting block 2883086) [ 1503.560167] blk_update_request: I/O error, dev vda, sector 23064688 op 0x1:(WRITE) flags 0x4800 phys_seg 250 prio class 0 [ 1503.560168] EXT4-fs warning (device vda): ext4_end_bio:342: I/O error 10 writing to inode 656425 starting block 2883418) [ 1503.560237] EXT4-fs warning (device vda): ext4_end_bio:342: I/O error 10 writing to inode 656425 starting block 2883584) # emerge -pO qemu These are the packages that would be merged, in order: [binary U ] app-emulation/qemu-6.1.0-1::gentoo [6.0.0-r53::gentoo] USE="aio bzip2 caps curl doc fdt filecaps fuse io-uring jemalloc jpeg lzo ncurses oss pin-upstream-blobs png seccomp ssh udev vhost-net vnc xattr zstd -accessibility -alsa -capstone -debug -glusterfs -gnutls -gtk -infiniband -iscsi -jack -multipath -nfs -nls -numa -opengl -plugins -pulseaudio -python -rbd -sasl -sdl -sdl-image (-selinux) -slirp -smartcard -snappy -spice -static -static-user -systemtap -test -usb -usbredir -vde -vhost-user-fs -virgl -virtfs -vte -xen -xfs" PYTHON_TARGETS="python3_8 python3_9 -python3_10" QEMU_SOFTMMU_TARGETS="x86_64 -aarch64 -alpha -arm -avr -cris -hppa -i386 -m68k -microblaze -microblazeel -mips -mips64 -mips64el -mipsel -nios2 -or1k -ppc -ppc64 -riscv32 -riscv64 -rx -s390x -sh4 -sh4eb -sparc -sparc64 -tricore -xtensa -xtensaeb (-lm32%) (-moxie%) (-unicore32%)" QEMU_USER_TARGETS="-aarch64 -aarch64_be -alpha -arm -armeb -cris -hexagon -hppa -i386 -m68k -microblaze -microblazeel -mips -mips64 -mips64el -mipsel -mipsn32 -mipsn32el -nios2 -or1k -ppc -ppc64 -ppc64abi32 -ppc64le -riscv32 -riscv64 -s390x -sh4 -sh4eb -sparc -sparc32plus -sparc64 -x86_64 -xtensa -xtensaeb" 0 KiB
(In reply to soundbastlerlive from comment #0) > after upgrading around 10 gentoo hosts from qemu-6.0.0-r53 to 6.1.0 most VMs > (around 85 of 100, our VMs with PostgreSQL have 100% chance of hitting this) > after some time (few minutes) will have I/O Errors, causing crashes and data > corruption. > The VMs are stored on ZFS volumes. > Downgrading to qemu-6.0.0-r53 instantly fixes this. > Happens on completely different hardware (quad core Xeons to 32C Epyc2). > > I guess this is an upstream bug? Yes, could you report it upstream? Thanks for letting us know, this is nasty.
I've had a look and neither Debian nor Fedora seem to be including any patches for something like this, which is a sign it may not have been fixed upstream yet.
see https://gitlab.com/qemu-project/qemu/-/issues/649 command to reproduce # root.img and swap.img are ZFS volumes, but I don't think that's relevant # qemu-system-x86_64 -name template5 -smp cores=4,threads=1,sockets=1 -k de -m 6G -vnc 0.0.0.0:1,lossy=on -net nic,macaddr=DE:AD:BE:EF:05:01,model=virtio,netdev=net0 -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -enable-kvm -machine q35,accel=kvm -cpu host -curses -display none -vga virtio -daemonize -rtc base=localtime -kernel /mnt/data1/vm/kernel-5.14.8-gentoo -append 'root=/dev/vda' -drive aio=io_uring,media=disk,discard=on,cache=writeback,if=virtio,format=raw,file=root.img,index=0,throttling.iops-read=125000,throttling.iops-write=25000,throttling.bps-read=536870912,throttling.bps-write=104857600 -drive aio=io_uring,media=disk,discard=on,cache=unsafe,if=virtio,format=raw,file=swap.img,index=1,throttling.iops-read=200000,throttling.bps-read=838860800,throttling.iops-write=50000,throttling.bps-write=209715200 --device virtio-balloon -watchdog i6300esb -watchdog-action reset -chroot /mnt/data1/vm/_TEMPLATE -runas kvm -pidfile _TEMPLATE.pid -monitor unix:monitor,server=on,wait=off -chardev socket,path=guestagent,server=on,wait=off,id=qga0 -device virtio-serial -device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=1721f84b8e877c2ffce831c621a7e2bcbfa4343a commit 1721f84b8e877c2ffce831c621a7e2bcbfa4343a Author: John Helmert III <ajak@gentoo.org> AuthorDate: 2021-09-29 21:56:35 +0000 Commit: John Helmert III <ajak@gentoo.org> CommitDate: 2021-09-29 22:03:06 +0000 profiles: mask qemu-6.1.0 for data corruption bug Bug: https://bugs.gentoo.org/815379 Signed-off-by: John Helmert III <ajak@gentoo.org> profiles/package.mask | 4 ++++ 1 file changed, 4 insertions(+)
VM snapshots via QMP api is broken too: {"return": [{"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-delete", "id": "vmdel-deb10-2021-09-23-10-51-08-YeULbBBM"}, {"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-delete", "id": "vmdel-deb10-2021-09-23-10-51-07-bhPI17Ev"}, {"current-progress": 1, "status": "concluded", "total-progress": 1, "type": "snapshot-save", "id": "vmsave-deb10-2021-09-23-10-51-04-pK5ompz1", "error": "Error while writing VM state: Unknown error -1"}]}
I have no such problems. 8 VMs with moderat i/o on a software raid1 on 2 nvme's.
Uh, forgot to mention - ext4.
maybe it has to do with either io_uring (see parameters) or the fact it's on zfs volumes?
@soundbastlerlive@gmx.at did you try another aio value already? Just to rule out io_uring.
I retried qemu-6.1.0-r1 on a low impact host with only a few less important VMs. So far no issues, but I'll retry the other original hosts after some time has passed without issues. Maybe it was a kernel/zfs update since back then, that fixed it or it may still appear on the originally affected hosts.
unfortunately still broken, but maybe happens slightly less often :( tested all versions of the last weeks up to host: 5.10.82 vm: 5.15.5 qemu-6.1.0 vs qemu-6.0.1 zfs-2.1.1-r5 sys-fs/zfs-kmod-2.1.1-r4
I have not delved into either issue but I wonder if this is similar to #815469 and qemu is triggering the same issue as cp from newer coreutils.
(In reply to Daniel M. Weeks from comment #12) > I have not delved into either issue but I wonder if this is similar to > #815469 and qemu is triggering the same issue as cp from newer coreutils. I was wondering about this as well. Could anyone confirm this issue still occurs with the latest ~arch ZFS?
as stated I'm still having the issue with the latest ebuilds zfs-2.1.1-r5 zfs-kmod-2.1.1-r4 AFAIK the coreutils incompatiblity is not fixed, which is why v9 is still downgraded by zfs
Created attachment 757131 [details, diff] qemu-6.1.0-data-corruption.patch Anyone tried the patch in the upstream issue? https://gitlab.com/qemu-project/qemu/-/issues/649#note_749175547
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=aa5822554186cff978a26ec01ebe4a0806ad0ed4 commit aa5822554186cff978a26ec01ebe4a0806ad0ed4 Author: John Helmert III <ajak@gentoo.org> AuthorDate: 2021-12-08 01:19:11 +0000 Commit: John Helmert III <ajak@gentoo.org> CommitDate: 2021-12-08 01:21:11 +0000 app-emulation/qemu: add potential patch for data corruption bug Bug: https://bugs.gentoo.org/815379 Signed-off-by: John Helmert III <ajak@gentoo.org> .../qemu/files/qemu-6.1.0-data-corruption.patch | 114 +++ app-emulation/qemu/qemu-6.1.0-r2.ebuild | 913 +++++++++++++++++++++ 2 files changed, 1027 insertions(+)
Please test -r2. Let's let it gestate a few days, and if no problems come up, we can proceed soon :)
6.1.0-r2 works fine for a few days for ~100 VMs on 3 servers now. 6.2.0 supposedly has this patch as well, so an update would be nice. Thanks to everyone involved!
I still have corruptions: 6.2.0-r3, kernel 5.16.5, btrfs fs, raw image with nocow io_uring + virtio = i/o error io_uring + sata = i/o error after host reboot native + virtio = i/o error native + sata = OK
(In reply to Viktor Kuzmin from comment #19) > I still have corruptions: > > 6.2.0-r3, kernel 5.16.5, btrfs fs, raw image with nocow > > io_uring + virtio = i/o error > io_uring + sata = i/o error after host reboot > native + virtio = i/o error > native + sata = OK I think you may need to file a new bug both in Gentoo and upstream. Thank you.