First Last Prev Next    No search results available      Search page      Enter new bug
Bug#: 149327
Alias:
Product:
Component:
Status: RESOLVED
Resolution: FIXED
Assigned To: Gentoo Linux High-Performance Clustering Team <hp-cluster@gentoo.org>
Hardware:
OS:
Version:
Priority:
Severity:
Reporter: Ian Stakenvicius <ian@syndicated-productions.com>
Add CC:
CC:
URL:
Summary:
Status Whiteboard:
Keywords:

Filename Description Type Creator Created Size Actions
torque-queue-recov-freemem-fix.patch Patch to fix que_recov text/plain Ian Stakenvicius 2006-09-27 11:02 0000 579 bytes Details
torque-skip-dotnames.patch upstream patch to skip dot-filenames text/plain Ian Stakenvicius 2006-09-27 22:24 0000 262 bytes Details
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 149327 depends on: Show dependency tree
Show dependency graph
Bug 149327 blocks:
Votes: 0    Show votes for this bug    Vote for this bug

Additional Comments: (this is where you put emerge --info)







View Bug Activity   |   Format For Printing   |   XML   |   Clone This Bug


Description:   Opened: 2006-09-27 10:58 0000
OK i think this is big enough for a bug of its own.  When pbs_server starts it
tries to read the .keep file as it thinks it is a queue file.  This shouldn't
be a problem -- the server will error with a 'que_recov: read error', which
_should_ just be an annoying error message.

However, i had intermittent crashing issues with pbs_server when i was
modifying my queues (mainly when trying to delete recoverd jobs with qdel)..

A search through the code seems to indicate that there's an error in que_recov,
as when it allocates memory to load the queue data it adds that structure into
a linked list--but on error it just free's the structure and leaves the linked
list pointing at invalid memory.

I've sent a message upstream to find out if this is actually a bug.

------- Comment #1 From Ian Stakenvicius 2006-09-27 11:02:04 0000 -------
Created an attachment (id=98250) [edit]
Patch to fix que_recov

The following patch fixed the issue on my system -- i changed the direct free()
calls in que_recov to que_free() (the counterpart of que_alloc(), which is how
the structure is made).

------- Comment #2 From Donnie Berkholz 2006-09-27 11:21:16 0000 -------
Please reopen when you get a resolution from the upstream guys. I try to avoid
adding patches to our stuff until upstream's looked them over, in cases where
upstream is responsive.

------- Comment #3 From Ian Stakenvicius 2006-09-27 22:24:10 0000 -------
Created an attachment (id=98277) [edit]
upstream patch to skip dot-filenames

Upstream hasn't responded on the memory free'ing issuee yet, but they did patch
the load function to skip dot-filenames (as i mentioned in bug 149226) which is
what caused the issue for me in the first place.

I attached the patch they provided for convenience.

------- Comment #4 From Donnie Berkholz 2006-09-27 22:48:01 0000 -------
Ian, thanks for your quick response and effort on this! For reference,
http://www.clusterresources.com/pipermail/torquedev/2006-September/000314.html
is the post. I added the dot-file skip patch to 2.1.2-r2 since the problem's
likely to affect many people.

------- Comment #5 From Ian Stakenvicius 2006-09-28 06:14:26 0000 -------
The 'Patch to fix que_recov' above has been added upstream now too.
http://www.clusterresources.com/pipermail/torquedev/2006-September/000316.html

I guess we can close this one?

------- Comment #6 From Donnie Berkholz 2006-09-28 08:05:46 0000 -------
Is there any reason for us to add the original patch, now that we've got the
second one?

------- Comment #7 From Ian Stakenvicius 2006-09-28 08:14:29 0000 -------
Depends -- if que_recov() fails on one queue file (couldn't read the queue
file, file descriptor is bad) but then succeeds on another, then memory
corruption will probably occur (and the pbs_server will crash).  But, as
mentioned upstream, if this happens then you have bigger problems.  

And with the second patch (skipping dot-filenames) already applied, there isnt
any reason that loading the queue file will fail (assuming nobody plays with
the contents of server_priv/queues)..

------- Comment #8 From Donnie Berkholz 2006-09-28 08:17:08 0000 -------
I suppose we might as well add it, but I won't do a revision bump this time.

------- Comment #9 From Donnie Berkholz 2006-10-10 18:16:56 0000 -------
Applied the que_recov patch as well. Thanks!

First Last Prev Next    No search results available      Search page      Enter new bug