321005 – qemu-kvm (0.12.4) + virtio disk corrupts large volumes (>1TB)

Bug 321005 - qemu-kvm (0.12.4) + virtio disk corrupts large volumes (>1TB)

Summary: qemu-kvm (0.12.4) + virtio disk corrupts large volumes (>1TB)

Status:	VERIFIED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Highest critical
Assignee:	Gentoo QEMU Project

URL:	https://bugs.launchpad.net/ubuntu/+so...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-05-22 09:23 UTC by masc
Modified:	2010-10-13 17:17 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description masc 2010-05-22 09:23:20 UTC

See URL, this has been around for a while and persists in 0.12.4.
There's a patch to fix it (http://marc.info/?l=qemu-devel&m=127436114712437) would be good to see cherry picked for 0.12.4-r1

Reproducible: Always

Comment 1 Stefan Behte (RETIRED) gentoo-dev

2010-05-22 10:50:49 UTC

Indeed.

Comment 2 Doug Goldstein (RETIRED) gentoo-dev

2010-06-15 18:40:31 UTC

Thanks. Fixed in 0.12.4-r1

Comment 3 masc 2010-06-16 09:19:23 UTC

verified by filling ext4 formatted 1.5T drive wih rsync and fscking afterwards, no errors.

Comment 4 masc 2010-07-25 11:52:11 UTC

this bug is re-introduced in -r2 and -r3! 
I had to downgrade to -r1 to prevent (more) data corruption.

mkfs.ext4 on 1.5TB drive:

[   37.641845] Buffer I/O error on device vdg, logical block 27791617
[   37.641846] lost page write due to I/O error on vdg
[   37.641851] Buffer I/O error on device vdg, logical block 27791618
[   37.641852] lost page write due to I/O error on vdg
[   37.641854] Buffer I/O error on device vdg, logical block 27791619
[   37.641856] lost page write due to I/O error on vdg
[   37.641858] Buffer I/O error on device vdg, logical block 27791620
[   37.641859] lost page write due to I/O error on vdg
[   37.641861] Buffer I/O error on device vdg, logical block 27791621
[   37.641862] lost page write due to I/O error on vdg
[   37.641865] Buffer I/O error on device vdg, logical block 27791622
[   37.641866] lost page write due to I/O error on vdg
[   37.641868] Buffer I/O error on device vdg, logical block 27791623
[   37.641870] lost page write due to I/O error on vdg
[   37.641872] Buffer I/O error on device vdg, logical block 27791624
[   37.641873] lost page write due to I/O error on vdg
[   37.641875] Buffer I/O error on device vdg, logical block 27791625
[   37.641877] lost page write due to I/O error on vdg
[   37.641879] Buffer I/O error on device vdg, logical block 27791626
[   37.641880] lost page write due to I/O error on vdg
[   37.641949] end_request: I/O error, dev vdg, sector 222333944
[   37.642022] end_request: I/O error, dev vdg, sector 2930270088
[   37.642030] end_request: I/O error, dev vdg, sector 2930270152

Comment 5 Doug Goldstein (RETIRED) gentoo-dev

2010-07-27 23:40:21 UTC

The patch is included in the patch ball in -r2 and -r3. It's also included in 0.12.5.

Please retest.

Comment 6 masc 2010-07-28 09:13:26 UTC

behaviour does not occur with 0.12.5 (at least not with mkfs.ext4, will do more thorough tests next weekend).

seems so far, -r2 introduced a change which interferes with the virtio fix. maybe it's worth investigating so that bad surprises can be prevented in the future.

with this bug, it's already sufficient to just boot up vm without explicitly writing to a drive for data loss/corruption to occur.

Comment 7 masc 2010-07-28 22:30:52 UTC

0.12.5 is perfectly fine, verified as in #3

Comment 8 Lionel Bouton 2010-10-12 20:26:11 UTC

It seems there's a corner case left. I had these problems on 4 physical hosts (and commented on the corresponding sf.net bug as "gyver"). I migrated 3 of the 4 hosts to O.12.5-r1 which fixed the problems and allowed us to use virtio instead of emulated PIIX. I just tried to migrate the 4th one and it failed to solve the read errors in virtio block mode.

I have 3 VMs on this 4th host, 2 are x86, 1 is x86_64. All of them fail to boot with 0.12.5-r1 reporting read errors on /dev/vda. Reconfiguring them to use IDE works (but there are errors reported during the boot and the guest kernels switches to PIO after resetting the ide0 interface).
Booting all these VMs works with 0.11.1-r1.

Two details that might help :
1/
I use DRBD devices for all my virtual disks (on all 4 physical hosts),

2/
The "failing" host has different hardware, the underlying storage is based on an hardware RAID controller: a 3ware 8006-2LP with two SATA disks in RAID-1 mode (all other hosts have plain AHCI SATA controllers and use software raid). Currently the controller is rebuilding the array after we switched a failing disk with a brand new one (given there was downtime for maintenance I used the opportunity for upgrading qemu-kvm). Although there's no read error on the physical host as far as its kernel is concerned, read performance is suffering : 5MB/s top with a dd if=/dev/vda ...

Comment 9 masc 2010-10-12 20:52:20 UTC

(In reply to comment #8)
> Reconfiguring them to use IDE
> works (but there are errors reported during the boot and the guest kernels
> switches to PIO after resetting the ide0 interface).

if the behaviour also occurs with ide this might be a different problem.
I'd suggest to take this to qemu-kvm's (new) bugtracker https://bugs.launchpad.net/qemu or discuss in http://forums.gentoo.org/ first.

Comment 10 Lionel Bouton 2010-10-13 17:17:41 UTC

(In reply to comment #9)
> I'd suggest to take this to qemu-kvm's (new) bugtracker
> https://bugs.launchpad.net/qemu or discuss in http://forums.gentoo.org/ first.

Done so:
https://bugs.launchpad.net/qemu/+bug/660060