Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 149327 - sys-cluster/torque-2.1.2-r1 may crash due to que_recov reading .keep
Summary: sys-cluster/torque-2.1.2-r1 may crash due to que_recov reading .keep
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Unspecified (show other bugs)
Hardware: All Linux
: High normal (vote)
Assignee: Gentoo Cluster Team
URL:
Whiteboard:
Keywords: Inclusion
Depends on:
Blocks:
 
Reported: 2006-09-27 10:58 UTC by Ian Stakenvicius
Modified: 2010-09-10 19:00 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments
Patch to fix que_recov (torque-queue-recov-freemem-fix.patch,579 bytes, text/plain)
2006-09-27 11:02 UTC, Ian Stakenvicius
Details
upstream patch to skip dot-filenames (torque-skip-dotnames.patch,262 bytes, text/plain)
2006-09-27 22:24 UTC, Ian Stakenvicius
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ian Stakenvicius 2006-09-27 10:58:03 UTC
OK i think this is big enough for a bug of its own.  When pbs_server starts it tries to read the .keep file as it thinks it is a queue file.  This shouldn't be a problem -- the server will error with a 'que_recov: read error', which _should_ just be an annoying error message.

However, i had intermittent crashing issues with pbs_server when i was modifying my queues (mainly when trying to delete recoverd jobs with qdel)..

A search through the code seems to indicate that there's an error in que_recov, as when it allocates memory to load the queue data it adds that structure into a linked list--but on error it just free's the structure and leaves the linked list pointing at invalid memory.

I've sent a message upstream to find out if this is actually a bug.
Comment 1 Ian Stakenvicius 2006-09-27 11:02:04 UTC
Created attachment 98250 [details]
Patch to fix que_recov

The following patch fixed the issue on my system -- i changed the direct free() calls in que_recov to que_free() (the counterpart of que_alloc(), which is how the structure is made).
Comment 2 Donnie Berkholz (RETIRED) gentoo-dev 2006-09-27 11:21:16 UTC
Please reopen when you get a resolution from the upstream guys. I try to avoid adding patches to our stuff until upstream's looked them over, in cases where upstream is responsive.
Comment 3 Ian Stakenvicius 2006-09-27 22:24:10 UTC
Created attachment 98277 [details]
upstream patch to skip dot-filenames

Upstream hasn't responded on the memory free'ing issuee yet, but they did patch the load function to skip dot-filenames (as i mentioned in bug 149226) which is what caused the issue for me in the first place.

I attached the patch they provided for convenience.
Comment 4 Donnie Berkholz (RETIRED) gentoo-dev 2006-09-27 22:48:01 UTC
Ian, thanks for your quick response and effort on this! For reference, http://www.clusterresources.com/pipermail/torquedev/2006-September/000314.html is the post. I added the dot-file skip patch to 2.1.2-r2 since the problem's likely to affect many people.
Comment 5 Ian Stakenvicius 2006-09-28 06:14:26 UTC
The 'Patch to fix que_recov' above has been added upstream now too.
http://www.clusterresources.com/pipermail/torquedev/2006-September/000316.html

I guess we can close this one?
Comment 6 Donnie Berkholz (RETIRED) gentoo-dev 2006-09-28 08:05:46 UTC
Is there any reason for us to add the original patch, now that we've got the second one?
Comment 7 Ian Stakenvicius 2006-09-28 08:14:29 UTC
Depends -- if que_recov() fails on one queue file (couldn't read the queue file, file descriptor is bad) but then succeeds on another, then memory corruption will probably occur (and the pbs_server will crash).  But, as mentioned upstream, if this happens then you have bigger problems.  

And with the second patch (skipping dot-filenames) already applied, there isnt any reason that loading the queue file will fail (assuming nobody plays with the contents of server_priv/queues)..
Comment 8 Donnie Berkholz (RETIRED) gentoo-dev 2006-09-28 08:17:08 UTC
I suppose we might as well add it, but I won't do a revision bump this time.
Comment 9 Donnie Berkholz (RETIRED) gentoo-dev 2006-10-10 18:16:56 UTC
Applied the que_recov patch as well. Thanks!