Today after upgrade of sys-apps/watchdog-5.14 and /etc/init.d/watchdog restart my machine was rebooted multiple times. My /etc/watchdog.conf was like this: max-load-1 = 512 min-memory = 1 allocatable-memory = 1 watchdog-device = /dev/watchdog watchdog-timeout = 1000 interval = 400 realtime = yes priority = 1 pidfile = /var/run/sshd.pid It seems, that the new watchdog-5.14 does not correctly set watchdog-timeout to requested 1000s, but instead it sets it to maximum of 254s as can be seen by (luckily I have ipmi watchdog) >ipmitool mc watchdog get Watchdog Timer Use: SMS/OS (0x44) Watchdog Timer Is: Started/Running Watchdog Timer Actions: Power Cycle (0x03) Pre-timeout interval: 0 seconds Timer Expiration Flags: 0x10 Initial Countdown: 254 sec Present Countdown: 252 sec It worked for sure with 1000s and watchdog-5.13-r1. Temporary workaround is to set interval lower than about half of 254s, so I have it 100s now.
the change to add a limit of 254 happened here: http://sourceforge.net/p/watchdog/code/ci/12583e81eaa093dc1224df08c7de62541142c6c2/ although it's confusingly (wrongly?) listed as a "readability" commit later on the limit has been raised to 600 seconds: http://sourceforge.net/p/watchdog/code/ci/1eee507a1fb7eb6a13a11816ed999b0271f3c613/ either way, it's clearly wrong to silently do a min() on the timeout and ignore the interval/etc... i've reported this upstream so let's see what they have to say.
Let's see, what they do. IMHO they should drop the max(). Seems, it is not clear for them, why would someone use large values of timeout and interval, so I'll try to explain my point of view, maybe it helps: 1. I don't want small watchdog-timeout, I don't care if the server is down for 15-30 minutes, I can wait. What is important is that meanwhile I have time to see something on the KVM console maybe. The default 60s gives me almost no time to do so even if I realize the server is dead immediately (which is seldom the case). 2. If watchdog-timeout is 1000s I feel that the default interval 1s is just waste of electric power. Better let the CPUs sleep. It theoretically would be enough to set it to something like 998s, but in reality I saw that sometimes single interval is missed somewhere (e.g. large load?) and you have an unwanted reboot. So I ended up with very stable formula: interval = watchdog-timeout/2.5
I meant "..drop the min()" of course :)
latest upstream git repo has deleted the max limit if you want to give it a try. looks pretty straight forward to me.