Sometimes when I send some mail to users on my server, I receive an error message that the mail cannot be delivered because user is over quota, although quotas are not activated, and although the user receices the mail.
Any ideea why?
Portage 2.0.50-r8 (default-x86-1.4, gcc-3.3.3, glibc-188.8.131.5240420-r0, 2.6.7-gentoo-r8)
System uname: 2.6.7-gentoo-r8 i686 Pentium III (Coppermine)
Gentoo Base System version 1.5.1
CFLAGS="-march=pentium3 -O3 -pipe -fomit-frame-pointer -fprefetch-loop-arrays -ffast-math -fforce-addr -falign-functions=4 -mfpmath=sse"
CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3/share/config /usr/share/config /var/qmail/alias /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d"
CXXFLAGS="-march=pentium3 -O3 -pipe -fomit-frame-pointer -fprefetch-loop-arrays -ffast-math -fforce-addr -falign-functions=4 -mfpmath=sse"
FEATURES="autoaddcvs ccache fixpackages sandbox sfperms userpriv usersandbox"
USE="acl adns apache2 berkdb crypt curl fam flash gd gdbm gif gpm imap innodb java jpeg ldap libg++ libwww maildir mcal memlimit ncurses nls noauthcram nptl oss pam pdflib pg-hier pg-intdatetime pg-vacuumdelay pic png python readline samba slang slp spell sse ssl tcpd tiff truetype x86 xml xml2 zlib"
mail-mta/qmail-1.03-r15 +noauthcram -notlsbeforeauth +ssl
C'mon ppl... please, look at this bug.
i'd suggest you go and look at the qmail source yourself, and see the snippet of code that controls that message being generated, and see why it sees your users differently.
Unfortunately I can't figure out what is wrong with it... and it seems qmail is unconditionally patched with maildir++ support.
I didn't find a way to control the maildir quota also... :(
I got the same problem today. Visually inspecting the maildir++ patch, I think I found a problem, although this might not be the cause of the misbehavior.
It's maildirquota.c, and is the classic "forgot to add one for the null character":
static int doaddquota(const char *dir, int maildirsize_fd,
const char *quota_type, long maildirsize_size, int maildirsize_cnt,
struct stat stat_buf;
} u; /* Scrooge */
struct iovec iov;
struct iovec *p;
if ( maildirsize_fd < 0)
+--- no room for terminating '\0' character
if (!newname2) return (-1);
strcat(strcpy(newname2, dir), "/maildirfolder");
+--- writes '\0' outside the buffer
if (stat(newname2, &u.stat_buf) == 0)
strcat(strcpy(newname2, dir), "/..");
n=doaddquota(newname2, maildirsize_fd, quota_type,
sizeof("1234") == 5 fyi
Yip, I thought that was off by 1 too until I read closely, but the maildir++ patch is deffinately the culprit. Afaik this is only used in conjunction with vpopmail ...? Could we therefore not use the vpopmail USE flag to decide whether the patch needs to be applied or not? Otherwise we need to fix the bug - which I suspect is an uninitialised variable of some sort.
There are several other things besides vpopmail that do use the maildir++ quotas - hence we need to keep the patch there.
Looking at the patch myself, I can't find the problem, beyond suspecting that if the maildir was created by hand, then the quota files may not be in place.
(it should always be made with maildirmake).
I'm still getting the same error. Maybe it's because I have maildrop delivering mail into the same maildir as qmail.
Slightly OT, but I think there should be a "more vanilla" option for qmail, because code like this patch is breaking Bernstein's rule of avoiding the C library as unsafe. I say "more vanilla" because maybe you'd include a minor patch here and there for the stuff like synchronous I/O that I vaguely remember from installing vanilla qmail on Slackware boxes.
Correction: I seem to have getmail, not maildrop, delivering mail to the same maildir as qmail. I think maybe it used to be maildrop but then getmail started supporting maildirs. Maildrop's manpage says something about quotas but if getmail does not support them....
More feedback: It's not the maildir++ patch. The maildir++ patch merely aggravates the real issue - which I'm still trying to locate. I've tracked the underlying issue to qmail-local.c in the maildir_child() function. THe reason for getting user_over_quota is because the maildir++ patch alters maildir() (also in qmail-local.c) to "bounce" the message with a temporary error, even though it claims it was delivered. In this way no mail is lost, but the recipient of the new message gets confused and it irritates everyone. The message is not delivered. Backing out maildir++ causes maildir() to return with 111, causing a temporary error and re-delivery. The possible list of underlying problems seems to be:
1) cannot chdir() to to the maildir.
- this is unlikely as the mail will not be delivered in duplicate then.
2) the maildir_getquota() and/or user_over_maildirquote() functions is bust.
- unlikely as the real problem still occurs when maildir++ is backed out.
3) There is a loop that generates the tmp/ filename. If this exeutes 3
times with no success.
- I doubt this is it as the message is at this point not yet written to
disc and as such won't result in a duplicate message.
4) If it cannot craete the temporary file.
- same argument.
5) the alarm() goes off. Unlikely - it's set to 19 hours, I get duplicates
about every 5 mins if/when I get them.
6) Any of the code that writes to the tmp/ file fails. Unlikely.
7) or most likely imho the following snippet of code:
if (link(fntmptph,fnnewtph) == -1) goto fail;
if ((fd = open(fnnewtph, O_RDONLY)) < 0 ||
fsync(fd) < 0 || close(fd) < 0) goto fail;
that imho contains a race condition. If the link succeeds, then the
message is delivered. The open is there to double check that the link
did in fact succeed, but if courier-imap has since moved the file to cur/
then the second if will fail, causing temporary delivery failure.
Right, posted too soon. The second of those lines are added by the qmail-link-sync patch. This was introduced at -r10 with no reference to it in the Changelog.
Is there a reason for having this patch in there? The ebuild says something about lack of synchronous link(), but I cannot see how that if "fixes" this particular problem. From the man page:
On NFS file systems, the return code may be wrong in case the NFS
server performs the link creation and dies before it can say so. Use
stat(2) to find out if the link got created.
This does however not account for the fact that the file might be moved by the time the check is performed. Looking at the link-sync patch this in fact introduces various race conditions all over the place. In addition to which, it should be using stat() as explained in the man page (as above).
I'm backing out link-sync now I'll report back later.
Ok, it's been two working days now plus a weekend and still no more duplicate messages so I can with a very high degree of convidence say that the link-sync patch was the culprit. Could we please have it backed out?
Right, no comment or feedback yet. I've been running for ages now with the patch backed out with no problems - the only one being a backlog of VERP-alised mailman bounces which doesn't get handled by the qmail2mailman script but that is the topic of another bug.
I repeat - is there *any* reason to have link-sync patch in qmail? If not, *please* back it out - it's causing trouble.
"A safe filesystem for the queue. qmail's reliability guarantee requires that the queue reside on a filesystem with traditional BSD FFS semantics. Most modern local filesystems meet these requirements with one important exception: the link() system call is often asynchronous--meaning that the results of the link() operation might not have been written to disk when the link() call returns. Bruce Guenter's syncdir library can be used to work around this problem. See syncdir in the Related Packages appendix for more information."
A good discussion about it is here:
-r10 of qmail predates my maintainership of Qmail.
The patch is briefly described in the ChangeLog as well.
01 Feb 2003; Nick Hadaway <firstname.lastname@example.org> qmail-1.03-r10.ebuild,
files/digest-qmail-1.03-r10, files/tls.patch, files/tls.txt :
..., ext/reiserfs non-synchronus link() fix, ...
syncdir is planned as the replacement for the link-sync patch, but that is some time away (we've been working on -r16 as time permits since last august).
I've found yet thing that causes the
It seems that if I have the folder opened with thunderbird it will cause the message to happen.
This only happens if I am using courier-imapd with IMAP_ENHANCEDIDLE=1 and famd
It does happen without the link-sync patch
Yes, probably because only mozilla mail/thunderbird knows how to use the imap IDLE command properly. I have however seen one or two other clients cause the same thing.
The reason why you would see it more often with famd enabled is because of faster reaction times from courier-imapds side. It actually moves the file before the open() command that follows.
As to still seeing duplicate messages after link-sync is backed out, I've yet to see another duplicate message. I've seen a few seemingly duplicates, but looking at the headers these were caused by an exim mail server earlier in the transport process. If you can offer a better explanation as to why there still can be duplicate messages after the link-sync patch has been removed, please do so.
As an explanation of why one would like to use the enhanced IDLE, well, it dropped the load average on my mail server from 0.35 to 0.1 - a huge improvement.
Is there a way to stop the erroneous User over quota messages to be sent to the sender when I use the enhanced Idle functionality ?
Yea, back out the link-sync patch. Do this by commenting out the appropriate line in the ebuild and then remerge qmail.
I think I took out the line that downloads the patch the first time around and not the one about patching qmail.
It works nicely without that patch.
I haven't seen a User Over Quota in a while, but maybe that's due to upgrades in getmail. I've got both getmail and Gentoo qmail putting mail in the same maildir.
I am mildly amused this issue has arisen, because I used vanilla qmail for a few years before using the Gentoo ebuild, and I used the syncdir library. Why do it any differently? Using a patch to qmail instead of an established general purpose library probably was a bad idea, and this is empirical evidence of it. It's no longer just a matter of opinion. :)
I note that this patch has still not been backed out in -r15. We've explained
now that it's problematic, we've illustrated the race condition, and still the
I just can't see any reason why anybody would want the patch? It is one big bug
to put it nicely, and having to manually back out a patch every time qmail gets
a new revision in portage (luckily not often) is annoying to say the least, and
what about other users that won't be able to track the problem?
Additionally the error message (User over quota) is misleading to say the least!
I beg again: Please back it out!
Isn't it possible to fix the patch?
No. The patch _is_ the problem. All it does is add three open() statements to
qmail to open the file to which was just link()'ed, and then close() it again
after fsync()'ing it. All this supposedly forces all modifications to the
filesystem to be flushed to disk.
There is however a race condition between the link() and open() calls. If
another process unlink()'s it then the open() fails causing qmail to think that
there is a temporary error, causing the message to be re-delivered. The only
way it will get unlink()'ed is if another process has already link()'ed it to
the cur/ or tmp/ directory (assuming all processes is playing by the rules).
If you really want to fix it, then remove the die() parts of it. Oops, but that
is the whole point of the patch - to catch, uhrm, no wait, I don't know what the
point of this patch is. I can't deduce it. It doesn't add _anything_ to qmail
but extra code to make it slower. Yes, _three_ extra system calls per message
Consider this, we _know_ the file was linked and written to tmp/ right? Ok, so
it is already on disk (unless it's in a buffer). So we want to move it to new/
where our mail client (or imap/pop3 server) can get it. So first we link() it
to new/. At this point in time it's (presumably) in both new/ and tmp/, so we
unlink it from tmp/ and it's delivered.
At this point in time it's not yet removed from the qmail queue! This only
happens once the delivery process returns success. So permitting that the
kernel doesn't switch write's to disk around (this is what the journal is for,
amongst others) the unlink() from the queue will also only happen after the
link() to new/ has happened.
The only thing that could possibly improve the situation is performing a fsync()
on the tmp/ file just before closing it after writing it's contents. Oh wait.
This is already in the vanilla qmail (line 126 thru 128 of qmail-local.c).
Right, so at this point in time (at the risk of repeating myself), we know that
tmp/?? is committed to disk. Now the sequence of system calls (probably missing
quite a few) will be something like:
Right, there are a few unlinks from the queue, but these don't really worry me.
Once the file has been sync()'ed to tmp/ we're good. From there the order of
the link() and unlink() system calls ensure that the message will not get lost.
The critical function that explains _why_ we get "user pver quota" is the
maildir() function. You will note that it fork()'s - this is in case of lockup
so that the child proces will get kill()'ed by SIGALRM set in maildir_child()
without taking down the entire delivery process. If anything happens but the
process exiting normally with exit code 0 we have a temporary error. Now one of
the other patches adds quota support, which adds to the switch() at the bottom
of maildir(), and the exit codes used by the quota checking system and link-sync
happens to be the same. So one can avoid the "User over quota" message and
simply get "Temporary error on maildir delivery. (#4.3.0)", but this will be
equally vague since delivery has actually succeeded! Right, so without user
quotas we would have received the vague error, with it we get the wrong error.
Thanks for the very detailed description of the problem. I'll look into it this
week (have other stuff to do today).
I've commented out the link-sync- and famd-patches. The famd patch modified code
from the link-sync patch and did nothing else. Can you test the newest -r16
I just typed an entire response for you in Afrikaans before realising I better
The famd-notify patch just ignores the error that would have caused the problem
by only reporting error _if_ we actually managed to open the file. Will
probably achieve the same thing but as explained earlier the link-sync patch
isn't really required in the first place.
I've been running without the link-sync patch since about a week after it first
apeared in stable (in -r13 iirc) without any problems, I reckon you can consider
it tested. If not, I could possibly merge -r16 on my production server to see
what happens, but imho running unstable on production servers is in general not
a good idea. Non-production servers isn't going to put enought pressure on the
system to properly test it.
Yeah, I would have had problems to fully understand Afrikaans. It's similar to
German in some parts, but not enough for me to understand. :-)
About putting qmail-1.03-r16 on production servers: I'm running it on at least
two servers I call "productive" and on two others non-productive (private)
purpose. Because I did noticed less problems with -r16 than with -r15
(especially on non-x86 architectures which make two of the four servers), I
consider it already stable.
It would be nice if you could give it at least a try and after verifing that the
problem doesn't exist anyre, please close this bug as closed.
Jaco: while I could have read the afrikaans, it's been ~6 years since moved out
of South Africa, so my translation of it would be a little rusty ;-).
It's gone. My users and myself are happy once more.
Thanks for testing.
Pleasure, as for closing closing it as requested in comment 27 - I don't seem to
have that option (I'm not the orriginal reporter, I'm merely on the CC list). I
only have "Leave as CLOSED TEST-REQUESTED".
Yes, I've closed the bug. There was a "Mark bug as CLOSED" option before.