Following an upgrade from 6.0.0-r3 to 6.2.0-r2, running /etc/init.d/libvirt-guests on system startup (boot) failed to start any virtual machines, despite stopping four virtual machines during shutdown. I found this commit: https://gitweb.gentoo.org/repo/gentoo.git/commit/app-emulation/libvirt?id=ca0a61eed33d17d0bd434ea5ad5c7bf2f891621c It changed the way the libvirtd daemon is started. As a result (on my system), /etc/init.d/libvirtd completes before the libvirtd daemon is ready to receive connections. As a result, libvirt-guests (the next service to be started) displays that it is starting four blank vms (i.e. names displayed as blanks). I made this change to /etc/init.d/libvirtd: From: start_stop_daemon_args="-b --env KRB5_KTNAME=/etc/libvirt/krb5.tab" To: start_stop_daemon_args="-b -w 3000 --env KRB5_KTNAME=/etc/libvirt/krb5.tab" The wait of 3 seconds this introduces is enough (on my system) to allow libvit-guests to find a responding libvirtd when it runs. However, this does not feel like a robust solution. It would be better for /etc/init.d/libvirtd to wait until it could connect to the service it started before exiting - but I do not know how to achieve that. Alternatively, libvirt-guests could be changed to loop a set number of times to test the libvirtd connection, but again I don't know how to best achieve that.
I've investigated this a bit further, and I am sorry that I missed the code at the start of the libvirt-guests start() function: for uri in ${LIBVIRT_URIS}; do do_virsh "${uri}" connect if [ $? -ne 0 ]; then eerror "Failed to connect to '${uri}'. Domains may not start." fi done The bug I am reporting is in this code. As I see it, there are two things wrong: 1) The function do_virsh returns a $? value of zero even when a connection fails. (I tested this is a little script of my own.) Thus the test always passes. 2) As per the main bug report, libvirtd is not ready for connections when its start script exits. Thus, at least on my system, if the above test worked, I would always get the eerror message, and the domains would not start, on bootup. (Note that this works fine if libvirtd is already running, as is the case when you just stop and start the domains while system remains up.) I would thus propose that the test code above be changed to use the command: local ruri= ruri=$(do_virsh "${hvuri}" uri) and then to check that ${ruri} and ${hvuri} are equal. If not equal, the connection failed. However, this also needs to take account of the boot startup problem (i.e. it will always show failure on my system). I would propose that an additional variable is added to the conf.d file: LIBVIRT_CONWAIT - the number of seconds to wait for a connection to be active. Then the startup check can loop down this count (with a sleep 1), stopping on the first successful connection for each uri. My shell script skills are rubbish, and I don't know how to create patches. However, if this approach is acceptable to the devs, I am willing to put some time in to try.
Created attachment 640570 [details, diff] libvirt-guests patch to wait for connections Patches init.d and conf.d for libvirt-guests Adds a configurable wait loop for successful connection on startup.
Test results on my system: 1) When libvirtd running: ian2 ~ # /etc/init.d/libvirt-guests start * Checking connection to qemu:///system ... * Connection to qemu:///system OK * Starting libvirt networks ... [ ok ] * Starting libvirt domains ... * gentoo-dns1 * gentoo-bubble-pnp * ian * gentoo-dns2 [ ok ] ian2 ~ # 2) Simulated boot by flushing caches: ian2 ~ # /etc/init.d/virtlogd stop * Stopping libvirtd ... [ ok ] * Stopping virtlogd ... [ ok ] ian2 ~ #sync; echo 1 > /proc/sys/vm/drop_caches ian2 ~ # /etc/init.d/libvirt-guests start * Starting virtlogd ... [ ok ] * Starting libvirtd ... [ ok ] * Checking libvirtd connection to qemu:///system ... . * Conection to qemu:///system OK * Starting libvirt networks ... [ ok ] * Starting libvirt domains ... * gentoo-dns1 * gentoo-bubble-pnp * ian * gentoo-dns2 [ ok ] ian2 ~ # Note single dot above indicating one second wait required. I commend the patch for consideration.
Created attachment 640702 [details, diff] libvirt-guests patch to wait for connections - correction I have updated the previous patch too use 'virsh connect' directly, and test $?. I thought this was better since I have no way of proving that my method with 'virsh uri' will work in every type of connection. This patch does solve the issue of do_virsh always returning zero (because of the final 'head -n -1' in the command pipe). I will post my test results in a follow up comment.
Test results on my system: I set LIBVIRT_CONWAIT=10 in /etc/conf.d/libvirtd-guests 1) With /etc/init.d/libvirtd stopped and removed from the run level. Expected result is a failure to connect after 10 attempts: ian2 ~ # rc-update del libvirtd * service libvirtd removed from runlevel default ian2 ~ # /etc/init.d/virtlogd stop * Stopping libvirtd ... [ ok ] * Stopping virtlogd ... [ ok ] ian2 ~ # /etc/init.d/libvirt-guests start * Checking connection to qemu:///system ... .......... * Failed to connect to 'qemu:///system'. Domains may not start. * Starting libvirt networks ... [ ok ] * Starting libvirt domains ... * * * * [ ok ] ian2 ~ # 2) With /etc/init.d/libvirtd stopped, but added the run level. Drop caches to simulate delay experienced at startup. Expected result is at least one failed connection before successful connection: ian2 ~ # rc-update add libvirtd default * service libvirtd added to runlevel default ian2 ~ # sync; echo 1 > /proc/sys/vm/drop_caches ian2 ~ # /etc/init.d/libvirt-guests zap * Manually resetting libvirt-guests to stopped state ian2 ~ # /etc/init.d/libvirt-guests start * Starting virtlogd ... [ ok ] * Starting libvirtd ... [ ok ] * Checking connection to qemu:///system ... . * Connection to qemu:///system OK * Starting libvirt networks ... [ ok ] * Starting libvirt domains ... * gentoo-dns1 * gentoo-bubble-pnp * ian * gentoo-dns2 [ ok ] ian2 ~ # 3) Restart after a stop with libvertd running. Expected result no wait on connection: ian2 ~ # /etc/init.d/libvirt-guests start * Checking connection to qemu:///system ... * Connection to qemu:///system OK * Starting libvirt networks ... [ ok ] * Starting libvirt domains ... * gentoo-dns1 * gentoo-bubble-pnp * ian * gentoo-dns2 [ ok ] ian2 ~ # My conclusion is that this patch keeps the functionality (using 'virsh connect') of the original script, but fixes some things: - It now tests a valid $? value which is not always zero - It allows a user to configure a wait time should they experience a timing issue on startup (as I did) - It allows a user to configure no delay if they know there will not be a need for one. I comment this patch for consideration.
*** Bug 736609 has been marked as a duplicate of this bug. ***
(From #736609: Georgy Yakovlev from comment #1) > ewaitfile in the initscript start_post() usually helps to wait for > socket/pidfile availability, it makes initscript return after file is > available and has a timeout option. > > you can define it in confd file as a workaround, but ideally it should be a > part if initscript itself of course.