Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 556456 - sys-apps/rng-tools-5: /etc/init.d/rngd will not start when used with TrueRNG
Summary: sys-apps/rng-tools-5: /etc/init.d/rngd will not start when used with TrueRNG
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Göktürk Yüksek
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-08-01 17:31 UTC by John Bowler
Modified: 2015-10-08 05:18 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
emerge --info sys-apps/rng-tools (rng-tools.info,5.85 KB, text/plain)
2015-08-06 15:23 UTC, John Bowler
Details
strace log fromo rngd (rngd.strace,3.98 KB, text/plain)
2015-08-13 23:44 UTC, John Bowler
Details
strace of parent rngd process in failing case (rngd.strace.27512,4.75 KB, text/plain)
2015-08-14 00:36 UTC, John Bowler
Details
child strace from rngd in failing case (rngd.strace.27516,16.11 KB, text/plain)
2015-08-14 00:37 UTC, John Bowler
Details
Patch to add O_NOCTTY when opening the RNG device (rng-tools-5-add-O_NOCTTY.patch,422 bytes, patch)
2015-08-14 05:32 UTC, John Bowler
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description John Bowler 2015-08-01 17:31:22 UTC
TrueRNG is a USB hardware random number generator implemented as a USB CDC modem; so it looks like a TTY to the OS.  rng-tools-5 works fine with it when it is supplied as an --rng-device and run from the command line, however when run from start-stop-daemon (the default with /etc/init.d/rngd) it (apparently) exits without warning during startup.  There is a partial analysis of the problem here:

https://forums.gentoo.org/viewtopic-t-992872-start-0.html

What seems to be happening is that start-stop-daemon closes streams or does stuff so that when rngd opens /dev/TrueRNG (typically a link to /dev/ttyACM0) the device becomes the rngd controlling terminal.  Subsequently (this is guesswork) rngd does something to its controlling terminal which causes rngd to exit.

As a workround changing $command_args to use '--foreground' and adding '--background --make-pidfile' to start-stop-daemon allows /etc/init.d/rngd to work and the random number generator seems to be working correctly, even though rngd has ttyACM0 as its controlling terminal.  Attaching to rngd using gdb suggests everything is ok.

As a probably better work-round I can skip start-stop-daemon completely by adding a 'start' function to /etc/init.d/rngd; one which just does:

start(){
   ebegin "Starting TrueRNG rngd"
   $command $command_args
   eend $?
}

(I.e. no other changes from the original /etc/init.d/rngd.)  In this case the controlling terminal is '?' and the 'ps j' output looks far more normal.

Looking at the source it seems that the problem may simply be the call to daemon(3); my guess is that start-stop-daemon has already done this and /dev/TrueRNG, which is opened *before* the daemon call, gets associated with file descriptor 0 then closed by the daemon call, but then I don't see why it works for things other than TrueRNG (or, maybe, it doesn't?)
Comment 1 Göktürk Yüksek archtester gentoo-dev 2015-08-06 04:39:58 UTC
Can I have the relevant 'emerge --info sys-apps/rng-tools' information? I'm mostly curious about uname, profile and openrc details.
Comment 2 Doug Goldstein (RETIRED) gentoo-dev 2015-08-06 13:06:54 UTC
I would run strace and get a full output of all the syscalls that are being called and we can see what's going wrong from there.
Comment 3 John Bowler 2015-08-06 15:23:47 UTC
Created attachment 408414 [details]
emerge --info sys-apps/rng-tools
Comment 4 John Bowler 2015-08-06 15:26:20 UTC
The emerge --info output is attached, however the kernel being used when I reported the bug was an earlier minor revision of 4.1 (4.1.3 I believe.)  Here's a summary of the information:

Portage 2.2.20 (python 3.4.3-final-0, default/linux/amd64/13.0/desktop/kde, gcc-5.1.0, glibc-2.21-r1, 4.1.4-gentoo x86_64)
System uname: Linux-4.1.4-gentoo-x86_64-Intel-R-_Core-TM-_i7-3770_CPU_@_3.40GHz-with-gentoo-2.2
sys-apps/openrc:          0.17::gentoo
Comment 5 Göktürk Yüksek archtester gentoo-dev 2015-08-13 17:49:40 UTC
I don't have a TrueRNG hardware, so it's harder for me to reproduce the issue. 

However, I tried the following: used tty0tty, a null modem emulator, to make two ttys talk to each other. Then using cat, I fed data from /dev/urandom to one tty and configured rngd (by editing /etc/conf.d/rngd) to read from the other tty. With the default rngd init script, I could not reproduce this bug. rngd did start normally and I observed an increase in the available entropy by checking the value of /proc/sys/kernel/random/entropy_avail.

I looked at the code of start-stop-daemon and it seems that unless '--background' is requested, it doesn't do anything with the file descriptors. Even in the case that '--background' is requested, it simply dup2()s the stdin, stdout, and stderr to /dev/null.

Likewise, the man page for daemon(3) suggests that it does not close any file descriptors, it merely redirects the streams to /dev/null.

The confusion about the controlling terminal might have been caused by a call to setsid(2) in start-stop-daemon, which rngd itself doesn't perform.

I will continue to investigate this but most of the debugging is likely to fall on you. Like cardoe suggested, perhaps you can provide an strace log to us?
Comment 6 John Bowler 2015-08-13 23:44:54 UTC
Created attachment 408960 [details]
strace log fromo rngd

This is the log produced by replacing /usr/sbin/rngd with this script in /etc/init.d/rngd:

#!/bin/sh
exec /usr/bin/strace -o/root/rngd.strace /usr/sbin/rngd "$@" 2>/root/rngd.log >&2
Comment 7 John Bowler 2015-08-13 23:46:16 UTC
The setsid explains the controlling terminal.  I can reproduce the bug by replacing command=/usr/sbin/rngd with command=/root/rngd where the latter is this shell script:

#!/bin/sh
exec /usr/sbin/rngd "$@" 2>/root/rngd.log >&2

I see the same error (rngd exits before the start-stop-deamon wait has expired so there is no pid file.)  /root/rngd.log is created but is empty.  With strace however:

#!/bin/sh
exec /usr/bin/strace -o/root/rngd.strace /usr/sbin/rngd "$@" 2>/root/rngd.log >&2

The whole shebang just works; I get a running rngd:

hippopopus ~ # /etc/init.d/rngd start
 * Starting rngd ...                                                                                                                                                                  [ ok ]
hippopopus ~ # cat /var/run/rngd.pid
26587
hippopopus ~ # ps ww 26587
  PID TTY      STAT   TIME COMMAND
26587 ?        Ss     0:00 /usr/sbin/rngd --pid-file /var/run/rngd.pid --background --random-step 64 --no-tpm=1 --no-drng=1 --fill-watermark 2048 --rng-device /dev/TrueRNG

and notice that the TTY is '?'.

The strace, of course, exits after the daemon call; see the previously attached log file (rngd.strace).

I'll see if I can come up with a script where strace produces the same failure.
Comment 8 John Bowler 2015-08-13 23:54:23 UTC
It's enough to make the thing work if I just remove the 'exec' in the shell script:

#!/bin/sh
/usr/sbin/rngd "$@"

So the failure *only* occurs when the original (pre-fork) rngd process is itself a process group leader.

I think it is down to the POSIX rules for the controlling terminal; the process group leader is dying (the original rngd), it has a controlling terminal and the child gets the expected HUP.  If there was no controlling terminal there would be no HUP.  I think; too long since I debugged BSD/POSIX 1003.1 issues like this.
Comment 9 John Bowler 2015-08-14 00:23:09 UTC
As a more refined hypothesis, I guess daemon(3) (being BSD) does call setpgrp(), and that is going to create a problem reading from the controlling terminal (if there is one) because rngd (the child) will now block when it tries to read from it.

Here's what POSIX 1003.1 has to say (section 7.1.1.3), my comments in []:

"If a session leader has no controlling terminal, and opens a terminal device file that is not already associated with a session without using the O_NO_CTTY option, it is implementation-defined [i.e. BSD and SysV differed in behavior] whether the terminal becomes the controlling terminal of the session leader."

So this is the case with a direct exec of rngd from start-stop-daemon and we can reasonably guess than Linux is behaving like SysV.  It's also the case with my --foreground work round, but in that case daemon() is not called and the process group is *not* changed.  POSIX then goes on to say:

"If a process which is not a session leader opens a terminal file, or the O_NOCTTY option is used on open(), that terminal shall not become the controlling terminal of the calling process."

So this is what my strace (etc) shell script caused to happen.

Then POSIX states (I've ommitted some stuff that I don't think is relevant):

"When a controlling process terminates, the controlling terminal is dissassociated from the current session, allowing it to be acquired by a new session leader.  Subsequent access to the terminal by other processes in the earlier session may be denied, with attempts to access the terminal treated as if modem disconnect had been sensed."

That seems to fit the observed behavior; rngd is ignoring HUP (I checked) but nevertheless it can't read from /dev/TrueRNG ever again.

I'll try strace -D -f and attach a log; I should be able to repro the problem with those options (I'm not familiar with strace, yet.)
Comment 10 John Bowler 2015-08-14 00:36:26 UTC
Created attachment 408962 [details]
strace of parent rngd process in failing case

Parent strace from:

#!/bin/sh
exec /usr/bin/strace -D -ff -ttt -o/root/rngd.strace /usr/sbin/rngd "$@"
#/usr/sbin/rngd "$@"
Comment 11 John Bowler 2015-08-14 00:37:19 UTC
Created attachment 408964 [details]
child strace from rngd in failing case

from:

#!/bin/sh
exec /usr/bin/strace -D -ff -ttt -o/root/rngd.strace /usr/sbin/rngd "$@"
#/usr/sbin/rngd "$@"
Comment 12 John Bowler 2015-08-14 00:55:51 UTC
So...  In the failing case:

1) rngd is the session leader, it opens /dev/TrueRNG *without* the O_NOCTTY flag and it gets it as the controlling terminal.
2) rngd then forks, at time 1439512108.737364
3) The child calls daemon() which, very first thing, calls setsid() at time 1439512108.737441
4) The *PARENT* rngd exits at time 1439512108.737434 with strace logging the final exit at time 1439512108.740669 (so the kernel closed the controlling terminal somewhere between these times.)
NOTE: this is a multi-cpu machine ;-)
5) The child calls read(3[==/dev/TrueRNG]) first at time 1439512108.737772
6) Remember: strace gets told the parent exited at 1439512108.740669
7) The child persists and calls read(3) *last* at time 1439512108.741423

Looks like a kernel bug to me.  All the same it's also a horrible race and there is an rngd bug there too; it shouldn't be opening the RNG device *without* O_NOCTTY (but, there again, who would have expected it to be a terminal?)

Because parent and child are probably on different CPUs I can believe that the strace times for the last read and the exit, which differ by under 1ms, aren't relevant, however I suspect the kernel caches the fact that the child isn't permitted to read from the terminal and hasn't got round to updating it (it doesn't have to be synchronous with the close of fd 3 in the parent.)

The real fix is for rngd to open the device O_NOCTTY (I think ;-)
Comment 13 John Bowler 2015-08-14 05:32:08 UTC
Created attachment 408966 [details, diff]
Patch to add O_NOCTTY when opening the RNG device

This patch stops the open of the RNG device causing it to become the controlling terminal.  It prevents the observed problem and is probably the best fix, assuming the previous analysis is correct.
Comment 14 Ian Delaney (RETIRED) gentoo-dev 2015-08-18 01:15:12 UTC
I get notification already via the proxy-maint alias.
Comment 15 John Bowler 2015-08-18 01:30:14 UTC
I should add that I think this is a POSIX gray area; the child inherited a controlling terminal from its parent, changed its session ID (so became its own session master) then tried to read from the inherited controlling terminal.

If it had tried to open it itself it would probably (given the timings) been able to do so and then been able to read.  It's entirely believable that Linus might have interpreted POSIX to mean that, in this case, a read from the inherited stream [3] should fail whereas a read from the new stream should succeed, even though they point to the same device.  That's consistent with the inverse behavior when a file is opened O_CREAT with mode 0; the opener can still write to the FD even though nothing else (even the same process) can open the file for write.
Comment 16 Göktürk Yüksek archtester gentoo-dev 2015-08-25 04:43:58 UTC
I believe the culprit is start-stop-daemon: it calls setsid() before execve() in the child process[1]. That should detach the process from the controlling terminal. Per POSIX, after setsid() the process shall have no controlling terminal[2]. Later on rngd opens TrueRNG as we can see in your strace logs. Without the 'O_NOCTTY' flag '/dev/TrueRNG' becomes the new controlling terminal. Then rngd goes onto calling daemon(), which calls another setsid()[3]. Before the daemon() call, it makes an attempt to read from the entropy source, so I'm not sure if the second setsid() is relevant or not.

Regardless, I think open() needs to be called with 'O_NOCTTY' by rngd. I can't really be sure if that's the solution to the problem or not. I'll try to reproduce the problem using a serial port connection and confirm the fix.

[1] https://gitweb.gentoo.org/proj/openrc.git/tree/src/rc/start-stop-daemon.c?h=0.17&id=0c2e4eb3cd7935d375b74099a3a9a5fe519e6cab#n1281
[2] http://pubs.opengroup.org/onlinepubs/9699919799/functions/setsid.html
[3] https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=misc/daemon.c;hb=2d02fd07371bcd492c320cec649c6265787d794a
Comment 17 Göktürk Yüksek archtester gentoo-dev 2015-09-13 20:41:44 UTC
Pull request submitted: https://github.com/gentoo/gentoo/pull/92
Comment 18 Göktürk Yüksek archtester gentoo-dev 2015-09-14 18:23:11 UTC
Based on the comments of mgorny, a new pull request is submitted: https://github.com/gentoo/gentoo/pull/95