223399 – sys-apps/openrc - remounting / r/o should time out

Bug 223399 - sys-apps/openrc - remounting / r/o should time out

Summary: sys-apps/openrc - remounting / r/o should time out

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] baselayout (show other bugs)
Hardware:	All Linux

Importance:	High enhancement
Assignee:	Gentoo's Team for Core System packages

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-05-24 01:23 UTC by Stefan de Konink
Modified:	2009-12-19 18:09 UTC (History)
CC List:	5 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stefan de Konink 2008-05-24 01:23:15 UTC

For one reason or another, lately my Xen machine hangs on the reboot. For some reasons network connections are not possible anymore. And if this situation happens the machine will not 'reboot' in a good matter.

On my screen the last lines

* Remounting remaining filesystems read-only ...
*   Remounting / ...   [ ok ]

*hang*


Would it be possible to add something that forces the system to reset lets say, after 5 minutes?

Reproducible: Sometimes

Steps to Reproduce:

Comment 1 Jeroen Roovers (RETIRED) gentoo-dev

2008-05-24 16:33:17 UTC

emerge --info and emerge -vp sys-apps/baselayout, please

Comment 2 Stefan de Konink 2008-05-24 19:15:14 UTC

paludis 0.26.1
Paludis build information:
    Compiler:
        CXX:                   x86_64-pc-linux-gnu-g++ 4.2.3 (Gentoo 4.2.3 p1.0)
        CXXFLAGS:              -march=nocona -O2 -pipe
        LDFLAGS:               
        DATE:                  2008-05-07T12:57:27+0200

    Libraries:
        C++ Library:           GNU libstdc++ 20080201

    Reduced Privs:
        reduced_uid:           101
        reduced_uid->name:     paludisbuild
        reduced_uid->dir:      /dev/null
        reduced_gid:           440
        reduced_gid->name:     paludisbuild

    Paths:
        DATADIR:               /usr/share
        LIBDIR:                /usr/lib64
        LIBEXECDIR:            /usr/libexec
        SYSCONFDIR:            /etc
        PYTHONINSTALLDIR:      
        RUBYINSTALLDIR:        

Repository virtuals:
    format:                    virtuals

Repository installed-virtuals:
    format:                    installed_virtuals
    root:                      /

Repository gentoo:
    format:                    ebuild
    location:                  /usr/portage
    append_repository_name_to_write_cache: true
    binary_destination:        false
    binary_keywords:           
    binary_uri_prefix:         
    builddir:                  /var/tmp/paludis
    cache:                     /usr/portage/metadata/cache
    distdir:                   /usr/portage/distfiles
    eapi_when_unknown:         0
    eapi_when_unspecified:     0
    eclassdirs:                /usr/portage/eclass
    ignore_deprecated_profiles: false
    layout:                    traditional
    names_cache:               /var/empty
    newsdir:                   /usr/portage/metadata/news
    profile_eapi:              0
    profiles:                  /usr/portage/profiles/default-linux/amd64/2007.0
    securitydir:               /usr/portage/metadata/glsa
    setsdir:                   /usr/portage/sets
    sync:                      rsync://rsync.gentoo.org/gentoo-portage
    sync_options:              
    use_manifest:              use
    write_cache:               /var/empty
    Package information:
paludis@1211656388: [QA version_spec.too_long] In program paludis --info:
  ... When fetching versions of 'sys-apps/baselayout' in installed:
  ... When loading package names from '/var/db/pkg' in category 'sys-apps':
  ... When parsing package dep spec '=sys-apps/net-tools-1.60_p20071202044231-r1':
  ... When parsing version spec '1.60_p20071202044231-r1':
  ... Number part '20071202044231' exceeds 8 digit limit permitted by the Package Manager Specification (Paludis supports arbitrary lengths, but other package managers do not)
        app-admin/eselect-compiler: (none)
        app-shells/bash:       3.2_p39
        dev-java/java-config:  (none)
        dev-lang/python:       2.4.3-r4 2.5.2-r4
        dev-python/pycrypto:   2.0.1-r6
        dev-util/ccache:       (none)
        dev-util/confcache:    (none)
        sys-apps/baselayout:   2.0.0
        sys-apps/openrc:       0.2.4-r1
        sys-apps/sandbox:      1.2.18.1-r2
        sys-devel/autoconf:    2.13 2.62
        sys-devel/automake:    1.10.1-r1 1.5 1.7.9-r1 1.9.6-r2
        sys-devel/binutils:    2.18-r1
        sys-devel/gcc-config:  1.4.0-r4
        sys-devel/libtool:     1.5.26
        virtual/os-headers:    2.6.25-r3 (for sys-kernel/linux-headers::installed)


I tracked it down to Xen, in relation to libvirtd, and domains running from iSCSI. Basically the iSCSI connection to the domains is killed before the domains are really shutted down. I think it would be a good thing for this kind of cases to provide a hang timer in openrc, so that after a specific timeout the system will be restarted.

Comment 3 Roy Marples 2008-05-29 13:43:03 UTC

Not easily - it's the umount command that's hanging.
To replicate, mount a NFS share and then unplug the cable.
Now try to unmount it. Watch it hang. If someone can tell me how to get it to return sensibly I'm all ears. Lazy umount doesn't work here.

Comment 4 Stefan de Konink 2008-05-29 14:31:41 UTC

(In reply to comment #3)
> Not easily - it's the umount command that's hanging.
> To replicate, mount a NFS share and then unplug the cable.
> Now try to unmount it. Watch it hang. If someone can tell me how to get it to
> return sensibly I'm all ears. Lazy umount doesn't work here.

In my opinion it should get a time that after this time just 'reboot', non-gracefully.

Comment 5 SpanKY gentoo-dev

2008-06-01 02:12:28 UTC

the great thing about timeout values is that no timeout is good enough for everyone

nfs is a good example of too much swing space to try and satisfy everyone

personally i mount all my stuff with like nolock,intr,soft because it's mostly read-only mounts

in order for the umount/shutdown process to be nice, we would background the umount and have the init.d script monitor that and input from the user (and optionally a timeout)

Comment 6 Stefan de Konink 2008-06-01 02:17:02 UTC

(In reply to comment #5)
> the great thing about timeout values is that no timeout is good enough for
> everyone

True.

> nfs is a good example of too much swing space to try and satisfy everyone
> 
> personally i mount all my stuff with like nolock,intr,soft because it's mostly
> read-only mounts
> 
> in order for the umount/shutdown process to be nice, we would background the
> umount and have the init.d script monitor that and input from the user (and
> optionally a timeout)

Isn't it possible to do something like a 'paralel' shutdown, or is this what is actually happening? The reason that we are not all pressing the 'restart' button is clear, we all want gracefully sync'ed hardrives and proper shutdown scenarios.

But in corporate or hosting scenarios it would be bizare to not have a timeout at all. And actually need to power-cycle the machine because the lack of connectivity occurs in the shutdown process.

Comment 7 Roy Marples 2008-06-01 06:30:09 UTC

The only problem is that your saying it's hanging in the code which we run when *everything* else has been shut down. Even if we had watchdog daemon, it would get killed by the killall5 command. For reference, here's the code *after* the command you posted was successful.

if [ ${unmounted} -ne 0 ]; then
        [ -x /sbin/sulogin ] && sulogin -t 10 /dev/console
        exit 1
fi
# Load the final script - not needed on BSD so they should not exist
[ -e /etc/init.d/"$1".sh ] && . /etc/init.d/"$1".sh
# Always exit 0 here
exit 0

So could you put
set -x
just above that code block so we can see specifically where it is hanging?

Comment 8 Mario Bachmann 2008-08-15 14:58:35 UTC

i have the same problem. system do not power off.
first i thought it is a nfs-problem. it was not. than i tought it is a mdadm problem. i am not sure. most time i have one not used harddisc in standby mode. perhaps that is a problem. i am not sure.

i tried the "set -x" command in the /etc/init.d/halt.sh (like you described before the section ).

result (i can not copy&paste, sorry):
after the " Remounting / ...   [ ok ]" output, some output appears from the /etc/init.d/shutdown.sh script. this script decides to use the command "/sbin/halt -dp"

now i tried the following. in the /etc/init.d/shutdown.sh i changed
"opts="-d" (at the beginning of the file) to "opts="-npf".
parameter are:
-n = no sync, includes -d
-f = force
-p = poweroff

i hope it helps.

the poweroff-problem is really scary because it is hard to test. i do not want to reboot my machine 1000 times for nothing.

Comment 9 Mario Bachmann 2008-08-19 16:17:26 UTC

hanging did not disappear :-(

is there a solution? where exactly is the problem?

Comment 10 Thomas Sachau gentoo-dev

2008-12-26 15:20:02 UTC

Maybe related, maybe not:

The last line i see is mount-ro trying to remount / read only. After that, nothing happens. If i change rc_parallel="YES" to ="NO", it does output an error about a file in /etc/init.d (cannot write, because already mounted ro), but then reboots.
Tested this behaviour with openrc-0.4.1.
No reboot after mount-ro trying to remount / ro also happens with openrc-0.4.0

Comment 11 Stefan de Konink 2008-12-26 15:36:44 UTC

It is unrelated to: http://bugs.gentoo.org/show_bug.cgi?id=252380 still both would require something like 'we waited long enough, now do a hard reboot'

Comment 12 Doug Goldstein (RETIRED) gentoo-dev

2009-02-11 14:39:49 UTC

Please re-test with OpenRC 0.4.3, which should address your issue.

Comment 13 Jeremy Olexa (darkside) (RETIRED) archtester

2009-12-19 18:09:43 UTC

(In reply to comment #12)
> Please re-test with OpenRC 0.4.3, which should address your issue.
> 

No feedback, closing as fixed.