OK i think this is big enough for a bug of its own. When pbs_server starts it tries to read the .keep file as it thinks it is a queue file. This shouldn't be a problem -- the server will error with a 'que_recov: read error', which _should_ just be an annoying error message. However, i had intermittent crashing issues with pbs_server when i was modifying my queues (mainly when trying to delete recoverd jobs with qdel).. A search through the code seems to indicate that there's an error in que_recov, as when it allocates memory to load the queue data it adds that structure into a linked list--but on error it just free's the structure and leaves the linked list pointing at invalid memory. I've sent a message upstream to find out if this is actually a bug.
Created attachment 98250 [details] Patch to fix que_recov The following patch fixed the issue on my system -- i changed the direct free() calls in que_recov to que_free() (the counterpart of que_alloc(), which is how the structure is made).
Please reopen when you get a resolution from the upstream guys. I try to avoid adding patches to our stuff until upstream's looked them over, in cases where upstream is responsive.
Created attachment 98277 [details] upstream patch to skip dot-filenames Upstream hasn't responded on the memory free'ing issuee yet, but they did patch the load function to skip dot-filenames (as i mentioned in bug 149226) which is what caused the issue for me in the first place. I attached the patch they provided for convenience.
Ian, thanks for your quick response and effort on this! For reference, http://www.clusterresources.com/pipermail/torquedev/2006-September/000314.html is the post. I added the dot-file skip patch to 2.1.2-r2 since the problem's likely to affect many people.
The 'Patch to fix que_recov' above has been added upstream now too. http://www.clusterresources.com/pipermail/torquedev/2006-September/000316.html I guess we can close this one?
Is there any reason for us to add the original patch, now that we've got the second one?
Depends -- if que_recov() fails on one queue file (couldn't read the queue file, file descriptor is bad) but then succeeds on another, then memory corruption will probably occur (and the pbs_server will crash). But, as mentioned upstream, if this happens then you have bigger problems. And with the second patch (skipping dot-filenames) already applied, there isnt any reason that loading the queue file will fail (assuming nobody plays with the contents of server_priv/queues)..
I suppose we might as well add it, but I won't do a revision bump this time.
Applied the que_recov patch as well. Thanks!