Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 940751 - sys-kernel/gentoo-sources-6.11.0 - app-emulation/qemu stuck in D state on mdraid-lvm-loopdev stack
Summary: sys-kernel/gentoo-sources-6.11.0 - app-emulation/qemu stuck in D state on mdr...
Status: RESOLVED NEEDINFO
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: AMD64 Linux
: Normal normal
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-10-04 11:46 UTC by Daniel Rozsnyo
Modified: 2024-10-21 12:02 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
sysrq - blocked calls on stuck qemu-loop-lvm-md stack (sysrq-qemu-loop-lvm-md.txt,5.86 KB, text/plain)
2024-10-04 11:46 UTC, Daniel Rozsnyo
Details
sysrq - blocked calls on stuck qemu + two lvs (sysrq-lvs-twice.txt,9.57 KB, text/plain)
2024-10-04 11:47 UTC, Daniel Rozsnyo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Rozsnyo 2024-10-04 11:46:39 UTC
Created attachment 904902 [details]
sysrq - blocked calls on stuck qemu-loop-lvm-md stack

Hello, I am running a VM with WinXP on my box and after an upgrade to kernel 6.11.0 it started misbehaving. Previous kernels were a 6.6.xx and 6.1.xx

The layer stack is this:

- gentoo-sources-6.11.0
- mdraid - raid 1 mirror of a HDD+SSD (hdd is flagged as write-mostly)
- lvm - to partition the above md device
- loopdev - to realign the unaligned winxp partition
- app-emulation/qemu-9.0.2-r2
- WinXP guest

Once the QEMU is stuck the RDP no longer works nor the spice protocol console nor the monitoring socket. Furthermore, LVM management commands such as lvs will get stuck in D-state as well.

To me it looks like some deadlock in this kernel version related to how my VM block-device stackup is made. The lvm volume has a MBR partitioned disk with C: starting at +1MiB (sector 2048, aligned) - that natively does not boot so I made a loop device with the 63 sectors preceeding that first partition start. Qemu is then booted with an unaligned drive, but aligned partition (win-win situation):

losetup -o $(( 512 * ( 2048 - 63 ) )) "$DRIVE_BOOT" "$DRIVE_FILE"

So far I have added aio=threads to qemu drive configuration to rule out the other async API issues but it just gets stuck with this as well.

Yesterday I had to enable CONFIG_MAGIC_SYSRQ so that I can issue: echo w >/proc/sysrq-trigger, whose output is attached. Running lvs&,lvs there is another dump attached.

Is there any way to unblock such locks? As this even prevents a graceful reboot - I have to manually stop the other VM's, stop NFS, umount what is umountable and then press the hard reset button. reboot/poweroff is not doing anything here by having those few processes in D state.

After killing the two lvs processes that were in (S/S+) state, they get into D as well (strace shown an async api being used inside lvs):

# ps auxf | grep D
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      7358  0.0  0.0      0     0 ?        D    11:32   0:00  \_ [kworker/u16:0+loop0]
root      9444  0.0  0.0      0     0 ?        D    11:52   0:00  \_ [kworker/u16:4+flush-251:4]
root     21331  0.0  0.0   6552  2132 pts/1    S+   13:44   0:00  |                       \_ grep --color=auto D
root     19518  0.0  0.0      0     0 pts/3    D    13:24   0:00                          \_ [lvs]
root     19967  0.0  0.0      0     0 pts/3    D+   13:29   0:00                          \_ [lvs]
Comment 1 Daniel Rozsnyo 2024-10-04 11:47:43 UTC
Created attachment 904903 [details]
sysrq - blocked calls on stuck qemu + two lvs
Comment 2 Mike Pagano gentoo-dev 2024-10-07 11:47:10 UTC
Couple of options,

1. Try with the latest 6.11.X kernel
2. Attempt with a 6.12-rcX kernel, but it's a little early in that series
3. Do a git bisect to determine if an offending commit in the kernel can be identified.