There are at least one race condition between the default crontab and /usr/sbin/run-crons. This comes from the find ${LOCKDIR} -name cron.${BASE} -exec rm {} \;, if the file gets rm'ed by one of the other entries in /etc/crontab between it being found and -exec then I get bogus emails (500 machines mailing me with this is irritating to say the least). Additionally there are other problems. For example, say I need to modify my system time back over the hour, then run-crons will rm that file and re-run those crons. This isn't a problem in most cases but it is deffinately bad. A much simpler, and better solution imho is to use the run-parts program as already suggested in two other bugs regarding run-crons. This is as simple as replacing 0 * * * * root rm -f /var/spool/cron/lastrun/cron.hourly 1 3 * * * root rm -f /var/spool/cron/lastrun/cron.daily 15 4 * * 6 root rm -f /var/spool/cron/lastrun/cron.weekly 30 5 1 * * root rm -f /var/spool/cron/lastrun/cron.monthly */10 * * * * root test -x /usr/sbin/run-crons && /usr/sbin/run-crons in /etc/crontab with 0 * * * * root /bin/run-parts /etc/cron.hourly 1 3 * * * root /bin/run-parts /etc/cron.daily 15 4 * * 6 root /bin/run-parts /etc/cron.weekly 30 5 1 * * root /bin/run-parts /etc/cron.monthly I suppose this lacks some locking, but even that is not foolproof in the existing run-crons, the trap line trap "rm -f ${LOCKFILE}" 0 1 2 3 15 can be replaced with trap "rm -f ${LOCKFILE}" EXIT to effectively cover all cases. One can also use the dotlockfile utility to create lock files: LOCKFILE=${LOCKDIR}/lock dotlockfile -l -r 2 -p "${LOCKFILE}" || exit $? trap "rm -f ${LOCKFILE}" EXIT This still has a small race-condition where the lock file can be created and we get killed, but it really is miniscule. This is what -p is for though, it does the pid-checking thing. Reproducible: Always Steps to Reproduce:
that would mean that every cron package would require debian-utils ... and as we've already seen, run-parts imposes obscure naming limitations
Fair enough. Then at least add a &>/dev/null onto the end of the find command to supress the obvious errors. Also, to emulate the not running crons twice effect only delete files that are more than three hours in the future. Why three hours? Well, from cron(8): "Local time changes of less than three hours, such as those caused by the start or end of Daylight Saving Time, are handled specially. This only applies to jobs that run at a specific time and jobs that are run with a granularity greater than one hour. Jobs that run more frequently are scheduled normally." It then proceeds to explain what happens with forward and backword movements as well as adjustments greater than three hours. As for "every cron would require debianutils", the following is provided by "qpkg -q debianutils": DEPENDED ON BY: app-portage/gentoolkit-0.2.0 sys-apps/portage-2.0.51.19 So I don't think that is a problem. As for the obscure naming restrictions: files that end in a tilde: These are generated by some editors as backup files. .bak: same thing. .dpkg-*: These are debian spesific and I can't see any reason why we should ever encounter them .pre_fcopy and .notslocate: not sure what these are for but I really can't see how we will ever encounter them. .disabled: this makes sense to ignore. As for the fact that they should be executable, that imho makes a great deal of sense. Anyhow, just a suggestion. Something to chew on.
read the run-parts manpage by default it will only respect files that match ^[a-zA-Z0-9_-]$ ... that means it ignores files that have a . or a space in them ... i'm not aware of cron enforcing such restrictions
Ooh ok. I must've been skimming again. That is not good. Ok. The question then becomes what effects does that race cause, well, there are three situations (taking the hourly crons as an example): 1. rm's the lockfile before find finds it. 2. rm's the lockfile after find found it but before -exec kicks in. 3. rm's the lockfile after find rm'ed it. No 1 is the desired behaviour. However, as I can note from some of the mails that get sent to me by hourly crons they sometimes runs on the hour, and sometimes 10 minutes after the hour. So rather than fix the race condition, here is an alternative that avoids it. Don't rm the files when the cron is intended to be run, rm it one minute before. For example, rm the file for hourly crons 1 minute before the hour, as in: 59 * * * * root rm -f /var/spool/cron/lastrun/cron.hourly Since run-crons is only run every 10 minutes it will be run on the hour, and since there is a minute between the two being executed there is no way (except perhaps on an extremely busy machine that takes more than a minute to schedule a task) that these two will ever interfere. I do note that daily and weekly already has something to this effect, and on monthly 10 minutes isn't going to make that big a difference, but I'd suggest changing the 30 for monthly to 29 in any case. Wasn't the granularity of run-crons /15 at some point?
Not fixed at all, but seemingly no one is going to actually do something about this, in spite of some dead-simple solutions provided.
closing the bug wont make it anymore "fixed"
You're absolutely correct, it won't. However, it's been open for three years. So that really suggests a WONTFIX.
i will put in this for all system crontabs installed by the various crons... --- crontab 26 Aug 2007 20:03:07 -0000 1.2 +++ crontab 13 Apr 2009 20:37:32 -0000 @@ -11,8 +11,8 @@ HOME=/ # check scripts in cron.hourly, cron.daily, cron.weekly and cron.monthly -0 * * * * root rm -f /var/spool/cron/lastrun/cron.hourly -1 3 * * * root rm -f /var/spool/cron/lastrun/cron.daily -15 4 * * 6 root rm -f /var/spool/cron/lastrun/cron.weekly -30 5 1 * * root rm -f /var/spool/cron/lastrun/cron.monthly +59 * * * * root rm -f /var/spool/cron/lastrun/cron.hourly +9 3 * * * root rm -f /var/spool/cron/lastrun/cron.daily +19 4 * * 6 root rm -f /var/spool/cron/lastrun/cron.weekly +29 5 1 * * root rm -f /var/spool/cron/lastrun/cron.monthly */10 * * * * root test -x /usr/sbin/run-crons && /usr/sbin/run-crons that should make the mentioned race conditions a lot less likely. objections?
No objections from my side. Chances that the rm isn't scheduled in a minute is next to none (and if it doesn't get scheduled you've got much bigger problems than missing a cron).
might as well do it right the first time ... as such, getting rid of the notion of "file exists and thus is a lock" is where we should start. there is a `flock` utility we should leverage which is installed by util-linux and that gives us perfect control against races. and it pushes the stale lock problem off into the kernel -- when the process dies/exits, the kernel releases the file lock.
changed default system crontab for vixie-cron, bcron, fcron, dcron and cronie. spanky: looking forward to the flock solution, though.
(In reply to comment #10) > as such, getting rid of the notion of "file exists and thus is a lock" is where > we should start. there is a `flock` utility we should leverage which is > installed by util-linux and that gives us perfect control against races. and > it pushes the stale lock problem off into the kernel -- when the process > dies/exits, the kernel releases the file lock. No, the file in this case is not a lock file, it's an indicator of when last the crons has run. I originally recommended a run-parts solution, but alas, there were two GOOD reasons for NOT using it. One is the naming limitations, and the other is described in the next paragraph. The idea with the file is that we "touch" the file whenever the crons is run, and rm it at certain times, allowing the fact that it doesn't exist to allow the crons to run again. This has the advantage that if cron misses a run (due to not being running or the server being off) it'll run at a later stage (ok, the rm didn't run either in this case, but this is what lines 65 thru 87 in /usr/sbin/run-crons takes care of). It's that piece of code that was racing with the rm's at fixed times. No amount of flock will stop this race, all it'll do is clear out the error message as produced by line 86 of the script (already taken care of with &>/dev/null, could also simply have passed -f to the rm). What Thilo has done is the correct solution, imho. I'd recommend resolving this bug at this time.
jaco - thanks for sharing your thoughts. your analysis is spot on. hence i am resolving this bug at this time. the flock solution is still needed though, due to all the other races in run-crons. see also bug #169449. thanks for your contribution.
Thilo: your changes to dcron broke it. dcron does not support the random extensions that vcron and such have implemented; it only supports POSIX. http://sources.gentoo.org/sys-process/dcron/files/crontab?r1=1.2&r2=1.3