For one reason or another, lately my Xen machine hangs on the reboot. For some reasons network connections are not possible anymore. And if this situation happens the machine will not 'reboot' in a good matter. On my screen the last lines * Remounting remaining filesystems read-only ... * Remounting / ... [ ok ] *hang* Would it be possible to add something that forces the system to reset lets say, after 5 minutes? Reproducible: Sometimes Steps to Reproduce:
emerge --info and emerge -vp sys-apps/baselayout, please
paludis 0.26.1 Paludis build information: Compiler: CXX: x86_64-pc-linux-gnu-g++ 4.2.3 (Gentoo 4.2.3 p1.0) CXXFLAGS: -march=nocona -O2 -pipe LDFLAGS: DATE: 2008-05-07T12:57:27+0200 Libraries: C++ Library: GNU libstdc++ 20080201 Reduced Privs: reduced_uid: 101 reduced_uid->name: paludisbuild reduced_uid->dir: /dev/null reduced_gid: 440 reduced_gid->name: paludisbuild Paths: DATADIR: /usr/share LIBDIR: /usr/lib64 LIBEXECDIR: /usr/libexec SYSCONFDIR: /etc PYTHONINSTALLDIR: RUBYINSTALLDIR: Repository virtuals: format: virtuals Repository installed-virtuals: format: installed_virtuals root: / Repository gentoo: format: ebuild location: /usr/portage append_repository_name_to_write_cache: true binary_destination: false binary_keywords: binary_uri_prefix: builddir: /var/tmp/paludis cache: /usr/portage/metadata/cache distdir: /usr/portage/distfiles eapi_when_unknown: 0 eapi_when_unspecified: 0 eclassdirs: /usr/portage/eclass ignore_deprecated_profiles: false layout: traditional names_cache: /var/empty newsdir: /usr/portage/metadata/news profile_eapi: 0 profiles: /usr/portage/profiles/default-linux/amd64/2007.0 securitydir: /usr/portage/metadata/glsa setsdir: /usr/portage/sets sync: rsync://rsync.gentoo.org/gentoo-portage sync_options: use_manifest: use write_cache: /var/empty Package information: paludis@1211656388: [QA version_spec.too_long] In program paludis --info: ... When fetching versions of 'sys-apps/baselayout' in installed: ... When loading package names from '/var/db/pkg' in category 'sys-apps': ... When parsing package dep spec '=sys-apps/net-tools-1.60_p20071202044231-r1': ... When parsing version spec '1.60_p20071202044231-r1': ... Number part '20071202044231' exceeds 8 digit limit permitted by the Package Manager Specification (Paludis supports arbitrary lengths, but other package managers do not) app-admin/eselect-compiler: (none) app-shells/bash: 3.2_p39 dev-java/java-config: (none) dev-lang/python: 2.4.3-r4 2.5.2-r4 dev-python/pycrypto: 2.0.1-r6 dev-util/ccache: (none) dev-util/confcache: (none) sys-apps/baselayout: 2.0.0 sys-apps/openrc: 0.2.4-r1 sys-apps/sandbox: 1.2.18.1-r2 sys-devel/autoconf: 2.13 2.62 sys-devel/automake: 1.10.1-r1 1.5 1.7.9-r1 1.9.6-r2 sys-devel/binutils: 2.18-r1 sys-devel/gcc-config: 1.4.0-r4 sys-devel/libtool: 1.5.26 virtual/os-headers: 2.6.25-r3 (for sys-kernel/linux-headers::installed) I tracked it down to Xen, in relation to libvirtd, and domains running from iSCSI. Basically the iSCSI connection to the domains is killed before the domains are really shutted down. I think it would be a good thing for this kind of cases to provide a hang timer in openrc, so that after a specific timeout the system will be restarted.
Not easily - it's the umount command that's hanging. To replicate, mount a NFS share and then unplug the cable. Now try to unmount it. Watch it hang. If someone can tell me how to get it to return sensibly I'm all ears. Lazy umount doesn't work here.
(In reply to comment #3) > Not easily - it's the umount command that's hanging. > To replicate, mount a NFS share and then unplug the cable. > Now try to unmount it. Watch it hang. If someone can tell me how to get it to > return sensibly I'm all ears. Lazy umount doesn't work here. In my opinion it should get a time that after this time just 'reboot', non-gracefully.
the great thing about timeout values is that no timeout is good enough for everyone nfs is a good example of too much swing space to try and satisfy everyone personally i mount all my stuff with like nolock,intr,soft because it's mostly read-only mounts in order for the umount/shutdown process to be nice, we would background the umount and have the init.d script monitor that and input from the user (and optionally a timeout)
(In reply to comment #5) > the great thing about timeout values is that no timeout is good enough for > everyone True. > nfs is a good example of too much swing space to try and satisfy everyone > > personally i mount all my stuff with like nolock,intr,soft because it's mostly > read-only mounts > > in order for the umount/shutdown process to be nice, we would background the > umount and have the init.d script monitor that and input from the user (and > optionally a timeout) Isn't it possible to do something like a 'paralel' shutdown, or is this what is actually happening? The reason that we are not all pressing the 'restart' button is clear, we all want gracefully sync'ed hardrives and proper shutdown scenarios. But in corporate or hosting scenarios it would be bizare to not have a timeout at all. And actually need to power-cycle the machine because the lack of connectivity occurs in the shutdown process.
The only problem is that your saying it's hanging in the code which we run when *everything* else has been shut down. Even if we had watchdog daemon, it would get killed by the killall5 command. For reference, here's the code *after* the command you posted was successful. if [ ${unmounted} -ne 0 ]; then [ -x /sbin/sulogin ] && sulogin -t 10 /dev/console exit 1 fi # Load the final script - not needed on BSD so they should not exist [ -e /etc/init.d/"$1".sh ] && . /etc/init.d/"$1".sh # Always exit 0 here exit 0 So could you put set -x just above that code block so we can see specifically where it is hanging?
i have the same problem. system do not power off. first i thought it is a nfs-problem. it was not. than i tought it is a mdadm problem. i am not sure. most time i have one not used harddisc in standby mode. perhaps that is a problem. i am not sure. i tried the "set -x" command in the /etc/init.d/halt.sh (like you described before the section ). result (i can not copy&paste, sorry): after the " Remounting / ... [ ok ]" output, some output appears from the /etc/init.d/shutdown.sh script. this script decides to use the command "/sbin/halt -dp" now i tried the following. in the /etc/init.d/shutdown.sh i changed "opts="-d" (at the beginning of the file) to "opts="-npf". parameter are: -n = no sync, includes -d -f = force -p = poweroff i hope it helps. the poweroff-problem is really scary because it is hard to test. i do not want to reboot my machine 1000 times for nothing.
hanging did not disappear :-( is there a solution? where exactly is the problem?
Maybe related, maybe not: The last line i see is mount-ro trying to remount / read only. After that, nothing happens. If i change rc_parallel="YES" to ="NO", it does output an error about a file in /etc/init.d (cannot write, because already mounted ro), but then reboots. Tested this behaviour with openrc-0.4.1. No reboot after mount-ro trying to remount / ro also happens with openrc-0.4.0
It is unrelated to: http://bugs.gentoo.org/show_bug.cgi?id=252380 still both would require something like 'we waited long enough, now do a hard reboot'
Please re-test with OpenRC 0.4.3, which should address your issue.
(In reply to comment #12) > Please re-test with OpenRC 0.4.3, which should address your issue. > No feedback, closing as fixed.