259183 – shutdown gentoo leaves out dead ssh connections

Bug 259183 - shutdown gentoo leaves out dead ssh connections

Summary: shutdown gentoo leaves out dead ssh connections

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] baselayout (show other bugs)
Hardware:	All Linux

Importance:	High normal
Assignee:	OpenRC Team

URL:
Whiteboard:
Keywords:

Duplicates (2):	367553 406169 (view as bug list)
Depends on:
Blocks:	439098
	Show dependency tree

Reported:	2009-02-16 06:51 UTC by Jerry Fleming
Modified:	2013-10-22 11:08 UTC (History)
CC List:	11 users (show)

See Also:	367553
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jerry Fleming 2009-02-16 06:51:20 UTC

When a gentoo box is brought down, it changes run level to 0, killing all daemon services from other run levels. This includes sshd, if it is running. But stopping sshd won't kill or terminate any connections; the connections process will automatically attach themselves to init, and are then closed without notifing the other part of the connection, leaving out dead connections. This is a problem when shutdown command is issued with a delay to let all processes have enough time saving their status and quit gracefully, because ssh clients are never given such time.

I find this problem on x86 and amd64 machines, so I guess it is probably on all platforms.

Reproducible: Always

Steps to Reproduce:
1. login to a gentoo box over ssh.
2. shutdown the server
3. the connection on the client is still there, but dead



Expected Results:  
The connection on the client should be terminated, returning to the shell before connecting.

killall5(8) should be called to stop the processes on shutting down, from /etc/init.d/halt.sh or somewhere.

Comment 1 Jeff Wallace 2009-04-26 18:16:40 UTC

This bug is also present on Arch Linux
http://bbs.archlinux.org/viewtopic.php?pid=543532

Comment 2 Roy Marples 2009-04-26 18:44:03 UTC

OpenRC has the killprocs init script which does just this.

Comment 3 Jeff Wallace 2009-04-26 18:51:31 UTC

http://roy.marples.name/projects/openrc/browser/trunk/init.d/killprocs.in ?

Comment 4 Roy Marples 2009-04-26 18:56:01 UTC

Yes, that one

Comment 5 Jeff Wallace 2009-04-26 19:16:26 UTC

I'm not sure how it is setup in Gentoo but in Arch Linux that is called already.

/etc/rc.shutdown:
...
# Terminate all processes
stat_busy "Sending SIGTERM To Processes"
/sbin/killall5 -15 &> /dev/null
/bin/sleep 5
stat_done

stat_busy "Sending SIGKILL To Processes"
/sbin/killall5 -9 &> /dev/null
/bin/sleep 1
stat_done
...

Still have the same problem.

PS. Sorry for filling up the gentoo bug tracker with Arch linux stuff...

Comment 6 Roy Marples 2009-04-27 09:25:28 UTC

This is because the network is shutdown before killprocs is called.
I've just comitted an update to OpenRC svn which will prevent the network script being stopped on runlevel change by default.
The nostop keyword will need to be added to other network related scripts, such as dhcpcd, but that may not be a good default.

Comment 7 Chris Smith 2010-06-25 11:33:05 UTC

And the fix is ?

Comment 8 Doug Goldstein (RETIRED) gentoo-dev

2010-06-25 19:11:06 UTC

Honestly I see this issue with nearly every distro I work with and I work with A LOT of distros. It's just a fact of life, use [enter] ~ . and be done with it.

Comment 9 SpanKY gentoo-dev

2010-07-31 03:33:10 UTC

ssh isnt special.  the same could happen with any client/server.  but i dont believe there is a way to sanely detect "this is a network process" and kill it before taking down the network.

this cannot be added to the `sshd` script because having `/etc/init.d/sshd stop` take down clients is wrong.

this cannot be added to the net.* scripts both because it cant be detected sanely and even if it could, it too would be wrong.

i dont see this as any sort of bug worth "fixing" as the vast majority of cases are valid in that the processes shouldnt be killed.  and when they arent, it isnt that big of a deal at all.  hit enter, then ~, then ., and be done with it.

Comment 10 SpanKY gentoo-dev

2012-03-01 03:59:37 UTC

*** Bug 406169 has been marked as a duplicate of this bug. ***

Comment 11 Richard Yao (RETIRED) gentoo-dev

2012-03-01 04:25:28 UTC

(In reply to comment #9)
> ssh isnt special.  the same could happen with any client/server.  but i dont
> believe there is a way to sanely detect "this is a network process" and kill
> it before taking down the network.
> 
> this cannot be added to the `sshd` script because having `/etc/init.d/sshd
> stop` take down clients is wrong.
> 
> this cannot be added to the net.* scripts both because it cant be detected
> sanely and even if it could, it too would be wrong.
> 
> i dont see this as any sort of bug worth "fixing" as the vast majority of
> cases are valid in that the processes shouldnt be killed.  and when they
> arent, it isnt that big of a deal at all.  hit enter, then ~, then ., and be
> done with it.

Could we check in the net.* scripts if any programs have connections over a given interface and issue kill commands to them?

Comment 12 Richard Yao (RETIRED) gentoo-dev

2012-03-01 05:00:18 UTC

(In reply to comment #11)
> Could we check in the net.* scripts if any programs have connections over a
> given interface and issue kill commands to them?

I looked into this in a bit more detail. Adding the following line to /etc/init.d/net.lo's stop() function seems to do the trick, provided that all of the other net.* scripts are symbolically linked to it:

for i in `ifconfig ${IFACE} | grep 'inet ' | awk '{ print $2}' | sed 's/addr://'`; do /bin/kill "$(lsof -iTCP@$i -Fp | cut -c2-)"; done;

This introduces a dependency on lsof and needs to be tweaked to work for IPv6, but it appears to work on my system.

Comment 13 Richard Yao (RETIRED) gentoo-dev

2012-03-01 05:25:59 UTC

radhermit in #gentoo-dev suggested the following improvement, which eliminates the use of grep and sed:

for i in `ifconfig ${IFACE} | awk '/inet / {gsub(/addr:/,"");print $2}'`; do /bin/kill "$(lsof -iTCP@$i -Fp | cut -c2-)"; done;

Comment 14 SpanKY gentoo-dev

2012-04-09 23:46:44 UTC

(In reply to comment #12)

i'm not sure that's acceptable either.  i can run `/etc/init.d/net.lo restart` on a remote box right now and not worry about my stuff getting punted.

Comment 15 Richard Yao (RETIRED) gentoo-dev

2012-04-10 03:44:19 UTC

(In reply to comment #14)
> (In reply to comment #12)
> 
> i'm not sure that's acceptable either.  i can run `/etc/init.d/net.lo
> restart` on a remote box right now and not worry about my stuff getting
> punted.

What if this were added to the shutdown runlevel, so that it only occurs when the system is actually preparing to halt/restart?

Comment 16 SpanKY gentoo-dev

2012-04-21 17:29:23 UTC

(In reply to comment #15)

might work.  feel free to post a PoC ;).

Comment 17 DNAspark99 2013-02-08 22:19:07 UTC

Hello, 

I've been aware of this issue for a long time. It seems to be gentoo-specific. I was discussing it on the forums and experimenting with various suggestions and methods. It seems the best place for this is actually in the /etc/init.d/sshd script itself. 
A simple check of the current runlevel, placed within the stop() block of the script, will cleanly close any active connections only during system shutdown.  

I do hope this can be resolved once and for all.  

 http://forums.gentoo.org/viewtopic-p-7242058.html#7242058


if [ "$RC_RUNLEVEL" = shutdown ]; then 
   ps auxw | grep sshd\: | grep -v grep | awk '{print $2}' | xargs kill -s 15 
fi

Comment 18 William Hubbs gentoo-dev

2013-02-08 22:53:01 UTC

Due to comment #9 as well as the additional comments in the forum thread, I am not comfortable with the fix in comment #17.

However, if you use newnet, we don't attempt to bring down the interfaces when we shut down, so why do we attempt to stop them with oldnet?

Here is something else to test instead of the suggestions in comment #17 or comment #15.

Add the following line at the very top of the stop function in net.lo; this will make the interfaces under oldnet behave like they do in newnet in this respect. They will not go down when the system is going down.

yesno $RC_GOINGDOWN && return 0

How do things behave if you do this?

Comment 19 DNAspark99 2013-02-08 23:48:37 UTC

Thanks for the response. I'll try to clarify a bit here. 

You say Comment #9 is correct? Let's examine that, shall we?:

> i dont see this as any sort of bug worth "fixing" as the vast majority of 
> cases are valid in that the processes shouldnt be killed.  and when they
> arent, it isnt that big of a deal at all.  hit enter, then ~, then ., and be
> done with it.


I fully understand and appreciate the need to preserve active connections during a process shutdown - such as during an upgrade of ssh. Absolutely correct. 

However, when the system is going down for shutdown or reboot - the parent sshd process, the associated login shell, and everything under it... are all ultimately terminated anyways, and the client has absolutely NO HOPE of recovering the connection. It's gone for good. 

When the system comes back up, you need to start a new connection anyways. Right? (Interestingly, once sshd comes back up, THAT's when a hung client will realize what's happened and finally wake-up and recognize there's been a real disconnect.) 
 
Now, nearly every other distro I've come across, *performs the courtesy* of closing any active ssh connections. 

Gentoo leaves them hanging. Why? 

What possible benefit is there to leaving otherwise *unrecoverable* ssh client connections in a hung state? 


Now I've tried your suggestion for putting "yesno $RC_GOINGDOWN && return 0" at the top of the stop block in net.lo, -and it does initially seem to appear the interface is kept up a bit longer during shutdown, ... but it's still leaving ssh clients hanging, no change there, sorry. 
 

Ultimately, it's of little consequence where the fix fits in - be it my suggested change to the sshd init script - or something elsewhere altogether. 

I just think it's wrong to label this as 'no big deal' and 'WONTFIX'. It's clearly a ongoing gentoo-specific issue, going on 10+ years now... but the problem is minor! the fix is easy! I've pointed the way! 

Unfortunately, That's all I can do.

Comment 20 William Hubbs gentoo-dev

2013-02-09 06:39:04 UTC

(In reply to comment #19)
> Thanks for the response. I'll try to clarify a bit here. 
> 
> You say Comment #9 is correct? Let's examine that, shall we?:

Sure, but you picked the wrong part of the comment:

(In reply to comment #9)
> ssh isnt special.  the same could happen with any client/server.  but i dont
> believe there is a way to sanely detect "this is a network process" and kill
> it before taking down the network.
> 
> this cannot be added to the `sshd` script because having `/etc/init.d/sshd
> stop` take down clients is wrong.
> 
> this cannot be added to the net.* scripts both because it cant be detected
> sanely and even if it could, it too would be wrong.

I'm attempting to propose what I think is a better solution than trying to kill the processes the way you are suggesting, because I don't agree that a solution specific to sshd is a good one, and I also am not comfortable adding code to the net.lo script to attempt to kill network processes.

I was able to reproduce your issue by connecting to a system using ssh then issuing a shutdown command. I did see what you see with ssh not disconnecting when the system was shut down.

Then, I added this line to the top of net.lo's stop() function, which is taken from the stop() function in the network script in newnet.

yesno ${shutdown_network:-YES} && yesno $RC_GOINGDOWN && return 0

When I added this line then executed the shutdown again, I was disconnected from the system before it went down instead of waiting until it came back up.

This is the result you are looking for. Correct?

Can you please verify that this works for you as well by adding this code to the top of your stop() function in net.lo then shutting down. If this works for you, you should be disconnected on shutdown instead of waiting for the system to come back up.

Thanks,

William

Comment 21 DNAspark99 2013-02-09 08:45:33 UTC

> added this line to the top of net.lo's stop() function, which is taken from 
> the stop() function in the network script in newnet.
>
> yesno ${shutdown_network:-YES} && yesno $RC_GOINGDOWN && return 0


Well, yes that does work with my system based of stage3-amd64-20121210, which is great going forward! 

And older system (that's been kept up to date), however, must be missing something...

Comment 22 William Hubbs gentoo-dev

2013-02-11 15:41:35 UTC

I added commit 1280b97 to OpenRc.
This means network interfaces will no longer come down by default, so
unless you change this, you should not see this issue.

Comment 23 William Hubbs gentoo-dev

2013-02-11 15:42:36 UTC

This will be part of OpenRc-0.12.

Comment 24 Joakim Tjernlund 2013-04-19 07:36:44 UTC

(In reply to comment #21)
> > added this line to the top of net.lo's stop() function, which is taken from 
> > the stop() function in the network script in newnet.
> >
> > yesno ${shutdown_network:-YES} && yesno $RC_GOINGDOWN && return 0
> 
> 
> Well, yes that does work with my system based of stage3-amd64-20121210,
> which is great going forward! 
> 
> And older system (that's been kept up to date), however, must be missing
> something...

Hi DNAspark99

Did you figure out why your older system didn't work with
 yesno ${shutdown_network:-YES} && yesno $RC_GOINGDOWN && return 0 
?
I too got an older system that has been kept up to date and this
doesn't work for me either

Comment 25 Joakim Tjernlund 2013-04-19 13:34:31 UTC

(In reply to comment #23)
> This will be part of OpenRc-0.12.

(In reply to comment #20)
> (In reply to comment #19)
> > Thanks for the response. I'll try to clarify a bit here. 
> > 
> > You say Comment #9 is correct? Let's examine that, shall we?:
> 
> Sure, but you picked the wrong part of the comment:
> 
> (In reply to comment #9)
> > ssh isnt special.  the same could happen with any client/server.  but i dont
> > believe there is a way to sanely detect "this is a network process" and kill
> > it before taking down the network.
> > 
> > this cannot be added to the `sshd` script because having `/etc/init.d/sshd
> > stop` take down clients is wrong.
> > 
> > this cannot be added to the net.* scripts both because it cant be detected
> > sanely and even if it could, it too would be wrong.
> 
> I'm attempting to propose what I think is a better solution than trying to
> kill the processes the way you are suggesting, because I don't agree that a
> solution specific to sshd is a good one, and I also am not comfortable
> adding code to the net.lo script to attempt to kill network processes.
> 
> I was able to reproduce your issue by connecting to a system using ssh then
> issuing a shutdown command. I did see what you see with ssh not
> disconnecting when the system was shut down.
> 
> Then, I added this line to the top of net.lo's stop() function, which is
> taken from the stop() function in the network script in newnet.
> 
> yesno ${shutdown_network:-YES} && yesno $RC_GOINGDOWN && return 0

This does not work on my system as I use dhcp. By the time /etc/init.d/killprocs
is executed, my IP address on eth0 is gone(eth0 is still up and running though)

Adding this to /etc/init.d/sshd, stop():
	if [[ "$RC_RUNLEVEL" = shutdown ]] ; then
	    pkill -f "sshd:"
	fi
does the trick though.

Comment 26 Joakim Tjernlund 2013-04-19 13:57:55 UTC

(In reply to comment #25)
> (In reply to comment #23)
> > This will be part of OpenRc-0.12.
> 
> (In reply to comment #20)
> > (In reply to comment #19)
> > > Thanks for the response. I'll try to clarify a bit here. 
> > > 
> > > You say Comment #9 is correct? Let's examine that, shall we?:
> > 
> > Sure, but you picked the wrong part of the comment:
> > 
> > (In reply to comment #9)
> > > ssh isnt special.  the same could happen with any client/server.  but i dont
> > > believe there is a way to sanely detect "this is a network process" and kill
> > > it before taking down the network.
> > > 
> > > this cannot be added to the `sshd` script because having `/etc/init.d/sshd
> > > stop` take down clients is wrong.
> > > 
> > > this cannot be added to the net.* scripts both because it cant be detected
> > > sanely and even if it could, it too would be wrong.
> > 
> > I'm attempting to propose what I think is a better solution than trying to
> > kill the processes the way you are suggesting, because I don't agree that a
> > solution specific to sshd is a good one, and I also am not comfortable
> > adding code to the net.lo script to attempt to kill network processes.
> > 
> > I was able to reproduce your issue by connecting to a system using ssh then
> > issuing a shutdown command. I did see what you see with ssh not
> > disconnecting when the system was shut down.
> > 
> > Then, I added this line to the top of net.lo's stop() function, which is
> > taken from the stop() function in the network script in newnet.
> > 
> > yesno ${shutdown_network:-YES} && yesno $RC_GOINGDOWN && return 0
> 
> This does not work on my system as I use dhcp. By the time
> /etc/init.d/killprocs
> is executed, my IP address on eth0 is gone(eth0 is still up and running
> though)
> 
> Adding this to /etc/init.d/sshd, stop():
> 	if [[ "$RC_RUNLEVEL" = shutdown ]] ; then
> 	    pkill -f "sshd:"
> 	fi
> does the trick though.

hmm, a bit cleaner might be:
  [[ "$RC_RUNLEVEL" = shutdown ]] && pkill -P `cat "${SSHD_PIDFILE}"`
just before stopping the parent. That way you don't kill any stray sshd's which
was started by other means.

Comment 27 DNAspark99 2013-04-23 16:52:50 UTC

in the end, I went with this (complete stop block for clarity):

stop() { 
        if [ "${RC_CMD}" = "restart" ] ; then 
                checkconfig || return 1 
        fi 

        ebegin "Stopping ${SVCNAME}" 
        start-stop-daemon --stop --exec "${SSHD_BINARY}" \ 
            --pidfile "${SSHD_PIDFILE}" --quiet 
        eend $? 

        if [ "${RC_RUNLEVEL}" = "shutdown" ]; then 
                SSH_CLIENT_PIDS="$(pgrep -f 'sshd:')" 
                if [[ -n ${SSH_CLIENT_PIDS} ]] ; then 
                    kill -TERM ${SSH_CLIENT_PIDS} 
                fi 
        fi 
}

Comment 28 William Hubbs gentoo-dev

2013-04-23 18:22:57 UTC

(In reply to comment #27)
> in the end, I went with this (complete stop block for clarity):

Please do not do this. Once OpenRC-0.12 is released, you will need to go back to the stock sshd init script. I have been testing here with git OpenRC and the connections are brought down fine with the stock script.

> stop() { 
>         if [ "${RC_CMD}" = "restart" ] ; then 
>                 checkconfig || return 1 
>         fi 
> 
>         ebegin "Stopping ${SVCNAME}" 
>         start-stop-daemon --stop --exec "${SSHD_BINARY}" \ 
>             --pidfile "${SSHD_PIDFILE}" --quiet 
>         eend $? 
> 
>         if [ "${RC_RUNLEVEL}" = "shutdown" ]; then 
>                 SSH_CLIENT_PIDS="$(pgrep -f 'sshd:')" 
>                 if [[ -n ${SSH_CLIENT_PIDS} ]] ; then 
>                     kill -TERM ${SSH_CLIENT_PIDS} 
>                 fi 
>         fi 
> }

Comment 29 DNAspark99 2013-04-23 18:30:32 UTC

(In reply to comment #28)
> (In reply to comment #27)
> > in the end, I went with this (complete stop block for clarity):
> 
> Please do not do this. Once OpenRC-0.12 is released, you will need to go
> back to the stock sshd init script. I have been testing here with git OpenRC
> and the connections are brought down fine with the stock script.
> 

Yes yes, this is certainly not required on newer/ up-to-date systems. 
However, for older systems, where it is undesirable to update the whole system, the changes to the sshd init script are the quick & dirty fix that works.  

http://forums.gentoo.org/viewtopic-t-950496-highlight-.html

Comment 30 SpanKY gentoo-dev

2013-04-27 09:08:47 UTC

*** Bug 367553 has been marked as a duplicate of this bug. ***

Comment 31 William Hubbs gentoo-dev

2013-08-14 06:45:41 UTC

This is part of the netifrc network scripts, which will be pulled in as
a separate package when you update to OpenRC-0.12.

Comment 32 Thomas Deutschmann (RETIRED) gentoo-dev

2013-08-16 14:31:43 UTC

Hi,

I am not sure if this is working in openrc-0.12 as expected:

I established a SSH connection via PuTTY to a system with openrc-0.12.

I now restart the system from a *local* shell.

I'll get the message

  "The system is going down for reboot NOW!4 (tty1) (Fri Aug 16 16:11:34 2013):"

in PuTTY, but the SSH connection won't be terminated.

When the system comes back online again (the system restarts very fast, <30secs), I'll hear the *pling* sound from PuTTY saying my connection is dead, now.

I expected to hear that *pling* sound while the system is shutting down.

I see the same when I ssh'd from another Gentoo box into that system just before I initialized the restart. This connection will hang until the system comes back online (or a normal timeout will happen). Then the (old) ssh connection will die with "Write failed: Broken pipe".

Comment 33 Alexander Vershilov (RETIRED) gentoo-dev

2013-08-20 20:16:25 UTC

There is no final solution yet.

Comment 34 Pierre Ozoux 2013-09-02 14:33:13 UTC

Hi,

I'm working on making available a template for packer to build gentoo VM [0]. I encountered this bug also. [1]

I'm wondering if you found a solution, and if yes, do you have an idea when will it be released?

If not, what is missing, and how can I help?

Thanks!

[0] : https://github.com/pierreozoux/packer-warehouse/
[1] : https://github.com/mitchellh/packer/issues/354

Comment 35 Thomas Deutschmann (RETIRED) gentoo-dev

2013-09-02 14:41:05 UTC

Hi,

you should not see this behavior when using any kind of static IP setup. Also, Gentoo only leaves dead connections when using dhcpcd, Gentoo's default DHCP client (i.e. when you use net-misc/dhcp, you should not see this at all).

But this is already "fixed" by upstream: http://roy.marples.name/projects/dhcpcd/changeset/f87ced10d4316cdf60dd2c0f1b38cc825e845c64

We are currently waiting for a new release...


If you experience the problem with a static IP setup, please tell us...

Comment 36 William Hubbs gentoo-dev

2013-09-27 05:42:54 UTC

This is fixed in dhcpcd-6.1.
Please re-open if it is still an issue after you upgrade to this version
of dhcpcd.