I have written a script that should backup all virtual KVM guests at night live. For this I create a external snapshot and cp with sparse the base image to a backup store. After that I commit changed data from the snapshot back into the base image and remove the snapshot, if the last action succeeded, Unfortunately this does not work. Neither with the stable nor with the unstable ebuild built versions. When doing the blockcommit thing, the machine suddenly dies. I could not find out, if the kernel killed it, nor if it seg faulted. All I got in journaled is: Aug 22 05:47:17 mon unknown[514]: <audit-2501> pid=514 uid=0 auid=4294967295 ses=4294967295 msg='virt=kvm resrc=cgroup reason=allow vm="mx.roessner-net.de" uuid=b942baf4-de75- 7086-854d-bfc542b4ec6d cgroup="/sys/fs/cgroup/devices/machine.slice/machine-qemu\x2dmx.roessner\x2dnet.de.scope/" class=path path="/var/lib/libvirt/images/mx.roessner-net.de.i mg" rdev=? acl=rw exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=success' Aug 22 05:47:19 mon qemu-system-x86[1084]: <audit-1701> auid=4294967295 uid=77 gid=77 ses=4294967295 pid=1084 comm="qemu-system-x86" exe="/usr/bin/qemu-system-x86_64" sig=6 Aug 22 05:47:19 mon kernel: grsec: denied resource overstep by requesting 4096 for RLIMIT_CORE against limit 0 for /usr/bin/qemu-system-x86_64[qemu-system-x86:1084] uid/euid:77/77 gid/egid:77/77, parent /usr/lib64/systemd/systemd[systemd:1] uid/euid:0/0 gid/egid:0/0 Aug 22 05:47:19 mon libvirtd[514]: libvirt version: 1.2.18 Aug 22 05:47:19 mon libvirtd[514]: internal error: End of file from monitor Aug 22 05:47:20 mon lldpd[636]: error while receiving frame on vnet3: Network is down Aug 22 05:47:20 mon kernel: br0: port 4(vnet3) entered disabled state Aug 22 05:47:20 mon kernel: device vnet3 left promiscuous mode Aug 22 05:47:20 mon kernel: br0: port 4(vnet3) entered disabled state Aug 22 05:47:20 mon unknown: <audit-1700> dev=vnet3 prom=0 old_prom=256 auid=4294967295 uid=77 gid=77 ses=4294967295 Aug 22 05:47:20 mon systemd-machined[899]: Machine qemu-mx.roessner-net.de terminated. Aug 22 05:47:21 mon unknown[514]: <audit-2500> pid=514 uid=0 auid=4294967295 ses=4294967295 msg='virt=kvm op=stop reason=failed vm="mx.roessner-net.de" uuid=b942baf4-de75-7086-854d-bfc542b4ec6d vm-pid=-1 exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=success' So you also see that this is on Gentoo hardened (grsec enabled, no RBAC in use) and systemd. I see sig=6, SIGABRT, but who or why was it called? Interesting is that while I copy away the image, free memory gets less and less. Until all 48GB RAM are completely cached. I know, normally this would be okay, but some days ago, the same virtual machine died totally unexpacted (while no backup scripts did even exist at this time) and the physical server also had cached all memory and even Zabbix had triggered an action and told me the machine would get out of memory. So my best guess is that this belongs together. Whenn the machine crashed several days ago, the server had 24GB of RAM. I thought RAM is too small and I doubled the memory. Now I have the same situation. And machine crashes again. So this bug might be difficult, is it? Could be kernel, libvirt or qemu or some combination out of it. My best guess: qemu Why? Maybe some bad memory allocation? As a current workaround I flush cached memory twice a day with: echo 1 > /proc/sys/vm/drop_caches echo 2 > /proc/sys/vm/drop_caches echo 3 > /proc/sys/vm/drop_caches With this bug open, I do not have a working live backup solution Reproducible: Always Steps to Reproduce: 1. Use Gentoo hardened stable kernel 2. Latest libvirt 3. Latest qemu 4. Create a script as attached 5. Create some test guests. 6. Run the script Actual Results: It MAY happen that blockcommit dies. Some guests work, some don't. And it is totally random, if it works or not. You can not say that one guest always fails to backup. Sometimes it works. Expected Results: Working live backup. emerge --info hardened-sources libvirt qemu Portage 2.2.20.1 (python 2.7.9-final-0, hardened/linux/amd64/no-multilib, gcc-4.8.4, glibc-2.20-r2, 4.1.4-hardened x86_64) ================================================================= System Settings ================================================================= System uname: Linux-4.1.4-hardened-x86_64-Intel-R-_Xeon-R-_CPU_L5520_@_2.27GHz-with-gentoo-2.2 KiB Mem: 49453536 total, 36320580 free KiB Swap: 2097148 total, 2092188 free Timestamp of repository gentoo: Fri, 21 Aug 2015 21:15:01 +0000 sh bash 4.3_p39 ld GNU ld (Gentoo 2.24 p1.4) 2.24 ccache version 3.1.9 [enabled] app-shells/bash: 4.3_p39::gentoo dev-lang/perl: 5.20.2::gentoo dev-lang/python: 2.7.9-r1::gentoo, 3.4.1::gentoo dev-util/ccache: 3.1.9-r4::gentoo dev-util/cmake: 3.2.2::gentoo dev-util/pkgconfig: 0.28-r2::gentoo sys-apps/baselayout: 2.2::gentoo sys-apps/openrc: 0.17::gentoo sys-apps/sandbox: 2.6-r1::gentoo sys-devel/autoconf: 2.69::gentoo sys-devel/automake: 1.15::gentoo sys-devel/binutils: 2.24-r3::gentoo sys-devel/gcc: 4.8.4::gentoo sys-devel/gcc-config: 1.7.3::gentoo sys-devel/libtool: 2.4.6::gentoo sys-devel/make: 4.1-r1::gentoo sys-kernel/linux-headers: 3.18::gentoo (virtual/os-headers) sys-libs/glibc: 2.20-r2::gentoo Repositories: gentoo location: /usr/portage sync-type: rsync sync-uri: rsync://rsync.europe.gentoo.org/gentoo-portage priority: -1000 x-portage location: /usr/local/portage masters: gentoo priority: 0 ACCEPT_KEYWORDS="amd64" ACCEPT_LICENSE="* -@EULA" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-O2 -pipe" CHOST="x86_64-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/share/easy-rsa /usr/share/gnupg/qualified.txt" CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/php/apache2-php5.6/ext-active/ /etc/php/cgi-php5.6/ext-active/ /etc/php/cli-php5.6/ext-active/ /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo" CXXFLAGS="-O2 -pipe" DISTDIR="/usr/portage/distfiles" EMERGE_DEFAULT_OPTS="--keep-going --with-bdeps=y --binpkg-respect-use=y --binpkg-changed-deps=y --usepkg=y --rebuilt-binaries=y --rebuilt-binaries-timestamp=20140405050000" FCFLAGS="-O2 -pipe" FEATURES="assume-digests binpkg-logs ccache compressdebug config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr" FFLAGS="-O2 -pipe" GENTOO_MIRRORS="http://de-mirror.org/gentoo/ rsync://de-mirror.org/gentoo/" LANG="en_US.utf8" LC_ALL="en_US.UTF-8" LDFLAGS="-Wl,-O1 -Wl,--as-needed" MAKEOPTS="-j17" PKGDIR="/export/packages" PORTAGE_CONFIGROOT="/" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages" PORTAGE_TMPDIR="/var/tmp" USE="acl adns aio amd64 bacula-clientonly bacula-console bash-completion berkdb bindist btrfs bzip2 caps cli cracklib crypt curl cxx device-mapper dri gdbm hardened iconv ipv6 justify logrotate loop-aes lzo mmap mmx mmxext modules ncurses nls nptl nscd ntp openmp openssl pam pax_kernel pcre pie readline seccomp session sse sse2 ssl ssp systemd tcpd threads unicode urandom vim-syntax xattr xtpax zlib" ABI_X86="64" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump author" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog aggregation cgroups contextswitch cpu cpufreq curl curl_json curl_xml disk email entropy ethstat exec filecount fscache hddtemp ipmi iptables logfile multimeter netlink network nfs nginx ntpd numa openvpn ping postgresql processes protocols python sensors snmp uptime users uuid" CPU_FLAGS_X86="mmx sse sse2" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ublox ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" LINGUAS="de en" NGINX_MODULES_HTTP="access auth_basic autoindex browser charset dav empty_gif fastcgi geo gzip headers_more limit_conn limit_req map memcached proxy referer rewrite scgi spdy split_clients ssi upstream_ip_hash userid uwsgi" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-6" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7 python3_4" QEMU_SOFTMMU_TARGETS="x86_64 i386" QEMU_USER_TARGETS="x86_64 i386" RUBY_TARGETS="ruby19 ruby20" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" Unset: CPPFLAGS, CTARGET, INSTALL_MASK, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHON ================================================================= Package Settings ================================================================= sys-kernel/hardened-sources-4.0.8::gentoo was built with the following: USE="symlink -build -deblob" sys-kernel/hardened-sources-4.1.4::gentoo was built with the following: USE="symlink -build -deblob" app-emulation/libvirt-1.2.18-r1::gentoo was built with the following: USE="audit caps fuse iscsi libvirtd lvm lxc macvtap nfs nls numa parted pcap qemu sasl systemd udev vepa -apparmor -avahi -firewalld -glusterfs -openvz -phyp -policykit -rbd (-selinux) -uml -virt-network -virtualbox (-wireshark-plugins) -xen" app-emulation/qemu-2.4.0::gentoo was built with the following: USE="aio caps curl fdt filecaps jpeg lzo ncurses nls numa pin-upstream-blobs png python sasl seccomp spice threads tls uuid vhost-net vnc xattr -accessibility -alsa -bluetooth -debug -glusterfs -gtk -gtk2 -infiniband -iscsi -nfs -opengl -pulseaudio -rbd -sdl -sdl2 (-selinux) -smartcard -snappy -ssh -static -static-softmmu -static-user -systemtap -tci -test -usb -usbredir -vde -virtfs -vte -xen -xfs" PYTHON_TARGETS="python2_7" QEMU_SOFTMMU_TARGETS="i386 x86_64 -aarch64 (-alpha) (-arm) -cris -lm32 (-m68k) -microblaze -microblazeel (-mips) -mips64 -mips64el -mipsel -moxie -or32 (-ppc) (-ppc64) -ppcemb -s390x -sh4 -sh4eb (-sparc) -sparc64 -unicore32 -xtensa -xtensaeb" QEMU_USER_TARGETS="i386 x86_64 -aarch64 (-alpha) (-arm) -armeb -cris (-m68k) -microblaze -microblazeel (-mips) -mips64 -mips64el -mipsel -mipsn32 -mipsn32el -or32 (-ppc) (-ppc64) -ppc64abi32 -s390x -sh4 -sh4eb (-sparc) -sparc32plus -sparc64 -unicore32" >>> Attempting to run pkg_info() for 'app-emulation/qemu-2.4.0' Using: app-emulation/spice-protocol-0.12.3 sys-firmware/ipxe-1.0.0_p20130925 sys-firmware/seabios-1.7.5 USE=binary sys-firmware/vgabios-0.7a
Created attachment 409828 [details] backup-qemu-live.sh This script backups KVM virtual machines to a different location.
Created attachment 409830 [details] Current kernel configuration This is my current kernel configuration
Created attachment 409840 [details] libvirt XML XML descritpion of the VM that died last night.
I just think about the kenel option numa_balancing=1, if this could lead to such problems...
Did some long time tests today with grsecurity totally disabled. Problem occured again. So this seems to be an issue with qemu.
More information, qemu aborts the machine as you can see in the logs: rns root@mon ~ # virsh blockcommit mx.roessner-net.de vda --wait --active --verbose --pivot Block commit: [ 85 %]error: failed to query job for disk vda error: Unable to read from monitor: Connection reset by peer 2015-08-23 16:20:28.495+0000: starting up libvirt version: 1.2.18, qemu version: 2.4.0 LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-system-x86_64 -name mx.roessner-net.de -S -machine pc-i440fx-2.1,accel=kvm,usb=off -cpu qemu64,+kvm_pv_eoi -m 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid b942baf4-de75-7086-854d-bfc542b4ec6d -no-user-config -nodefaults -device sga -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/mx.roessner-net.de.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-shutdown -boot order=cd,menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x8 -drive file=/var/backups/snapshots/backup-snapshot-mx.roessner-net.de.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=writeback -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=23 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=54:52:00:5f:78:c0,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/mx.roessner-net.de.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -vnc 127.0.0.1:1 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device i6300esb,id=watchdog0,bus=pci.0,addr=0x7 -watchdog-action reset -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x6 -msg timestamp=on char device redirected to /dev/pts/1 (label charserial0) Co-routine re-entered recursively 2015-08-23 16:27:05.697+0000: shutting down
I opened a bug report upstream: https://bugs.launchpad.net/qemu/+bug/1488901
Nug was fixed upstream commit e424aff5f307227b1c2512bbb8ece891bb895cef Author: Kevin Wolf <kwolf@redhat.com> Date: Thu Aug 13 10:41:50 2015 +0200 mirror: Fix coroutine reentrance I tested the master branch for more than 24h now and I can confirm the problem is gone.
fix is included in the 2.4.0-r1 bump: http://gitweb.gentoo.org/repo/gentoo.git/commit/?id=fec667228a95981586716b7d25004c4d706943e2