Summary: | sys-fs/lvm2-2.02.28-r2 snapshotting hangs under load | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Christopher Head <bugs> |
Component: | [OLD] Core system | Assignee: | Robin Johnson <robbat2> |
Status: | VERIFIED FIXED | ||
Severity: | critical | CC: | base-system, cardoe, rocket |
Priority: | High | ||
Version: | unspecified | ||
Hardware: | x86 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: | kernel configuration |
Description
Christopher Head
2007-12-10 19:30:19 UTC
Created attachment 138195 [details]
kernel configuration
Folks could you create yourself an lvm-bugs or whatnot alias? This gets a huge PITA to assign. Sounds like your hardware doesn't support enough IO for what you are doing. Your snapshot is also too small for the amount of writing you are doing - it should be large enough to handle the quantity of writes expected in the lifetime of the snapshot. I cannot reproduce this on my serious LVM box (that runs lvm snapshots every 4 hours for backups): 16-disk 10k RPM SAS, 16Gb RAM, 2x dual-core Opteron 2214 HE. (presently on 2.6.19.7, uptime 90 days) - the snapshoting is done on the MySQL datadir, and it takes a full minute with the amount of write IO that is seen by that LV. For testing that it's an IO problem. Create a system with TWO sets of disks. First is the system set that you are running off. Second is/are disk(s) for the test - absolutely nothing else should be on them, not mounted even until the test. Make VG 'vgtest' consisting of the test disks. Make LV 'foo' for your test. Open two windows for monitoring. In the first, run top. In the second run 'iostat -x -t 5 -m DEV' (with some set of DEV that are the real devices for vgtest). Run the test, and keep an eye on iowait cpu, system cpu, and the disk utilization. I tried on a smaller box of mine as well: 2x 750Gb in software RAID1, quad G5, 12Gb of RAM. system cpu was 12%, iowait was 30%, disk utilization was 5%. This was a 4-way box, so that implies a single-cpu box does not have enough CPU to run the test with my RAID1 disks. I'm afraid I don't have a system with a second disk to test with. However, I do have some news: I've reduced the "steps to reproduce" to a much simpler and more deterministic sequence: 1. lvcreate -L256M -nfoo vg0 2. lvcreate -s -L64M -nsnap vg0/foo 3. dd if=/dev/zero of=/dev/vg0/foo bs=1M count=256 A short way into the dd, syslog gets "device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.". Perfect! This is precisely what I expected. Here's the problem: the dd process also hangs, in state D+ (according to ps aux), aka "uninterruptible sleep". As the state suggests, it is unkillable. This is not so perfect: while I expected the snapshot to stop working once it runs out of space, it doesn't seem at all sensible for processes accessing the *original volume* to start hanging. Everything I've ever read about LVM suggests that the snapshot will start returning errors when you try to read from it, but the original volume will not be affected. Your new instructions, are you running #2 and #3 in parallel or series? I do get the kernel message you mention, but my boxes never lock up (again, they are all SMP with plenty of disk etc). [835297.942101] device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception. Series. No parallel operations this time. Also, new news: I haven't done extensive enough testing to be sure, but I think the problem may be gone as of 2.6.23. I just upgraded and can't reproduce. Can't reproduce any more under 2.6.23 on a bunch of machines. Thanks for your time! |