Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 295408 - sys-cluster/torque server name detection should be done in pkg_setup()
Summary: sys-cluster/torque server name detection should be done in pkg_setup()
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: New packages (show other bugs)
Hardware: All Linux
: High normal (vote)
Assignee: Justin Bronder (RETIRED)
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-12-02 09:37 UTC by Martin Mokrejš
Modified: 2010-03-02 02:03 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Mokrejš 2009-12-02 09:37:07 UTC
I did not check the ebuild code in much detail but it looks that on a running and configured system the ebuild fails to determine the server_name and thus in the end writes current(node) hostname into the file. Happens on stable amd64.
Comment 1 Jeroen Roovers (RETIRED) gentoo-dev 2009-12-03 18:17:42 UTC
It sounds like you want torque to interpret and set PBS_SERVER_NAME at install time instead of configure time (how it's done now).

Setting that variable to build the package would require building the same package for each node in turn with the variables set accordingly during each build run, whereas you would ideally want to set the variable and use it at install time, so that each node is uniquely configured with its own PBS_SERVER_NAME variable. Right?
Comment 2 Martin Mokrejš 2009-12-03 19:08:19 UTC
I do not think I bother whether configure or install step, really. Somehow, recompilation (actually downgrade from 2.4.2-beta to 2.3.6) re-wrote my /var/spool/torque/server_name and zapped the proper hostnames of each node with "localhost". I think the logic in the ebuild is breaking, do not know why. But, importantly, I think I should have been asked by etc-update whether I want to change the files. That did not happen so I think that is another issue with the 2.3.6 ebuild at least. Nodes correctly return names on hostname(1) and domainname(1) commands.

Commenting on your second paragraph ... I configure&compile&install on each node separately although it is wasteful. I do not think I mind if the server hostname is determined a bit earlier or later. It is not recognized at the moment at all so "localhost" kicks in.
Comment 3 Justin Bronder (RETIRED) gentoo-dev 2009-12-03 19:40:41 UTC
(In reply to comment #2)
> I do not think I bother whether configure or install step, really. Somehow,
> recompilation (actually downgrade from 2.4.2-beta to 2.3.6)

Can you reproduce using only version that are in the portage tree?
Comment 4 Justin Bronder (RETIRED) gentoo-dev 2009-12-03 21:05:25 UTC
The detection for the server_name does suck when PBS_SERVER_NAME is not defined and should be moved to pkg_setup().  However, /var/spool/torque/server_name is config protected, so I suspect you lost the file to whatever method you used to install torque-2.4.2-beta.

mejis openmpi # etc-update
Scanning Configuration files...
The following is the list of files which need updating, each
configuration file is followed by a list of possible replacement files.
1) /var/spool/torque/server_name (1)
Comment 5 Martin Mokrejš 2009-12-03 22:38:59 UTC
I do not know why, but the /var/spool/torque/._cfg0000_server_name was NOT created while installing 2.3.6. Then I unmasked 2.3.7 for amd64, installed that, still not this file reported by etc-update. Finally, to return to a stable version i downgraded to 2.3.6. And, here we go:

Showing differences between /var/spool/torque/server_name and /var/spool/torque/._cfg0000_server_name
--- /var/spool/torque/server_name       2009-12-01 15:25:44.000000000 +0100
+++ /var/spool/torque/._cfg0000_server_name     2009-12-03 23:32:12.000000000 +0100
@@ -1 +1 @@
-nfssrv.cluster.local
+node001.cluster.local


Sorry, I do not know why I started to mention "localhost" value. It places current hostname into the file. Maybe it has to do with the USE=server flag which is maybe NOT expected to be set on the nodes? But I do have it set because many packages will not provide some deamons I want to have.


# emerge -pv =sys-cluster/torque-2.3.6

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild   R   ] sys-cluster/torque-2.3.6  USE="crypt server -cpusets -doc -syslog -tk" 0 kB

# emerge -pv =sys-cluster/torque-2.3.7

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild     U ] sys-cluster/torque-2.3.7 [2.3.6] USE="crypt server syslog* -cpusets -doc -drmaa% -tk" 0 kB

#
Comment 6 Justin Bronder (RETIRED) gentoo-dev 2009-12-03 22:45:14 UTC
(In reply to comment #5)
> I do not know why, but the /var/spool/torque/._cfg0000_server_name was NOT
> created while installing 2.3.6. Then I unmasked 2.3.7 for amd64, installed
> that, still not this file reported by etc-update. Finally, to return to a
> stable version i downgraded to 2.3.6. And, here we go:

It's not going to be created if there are not differences between the files, which there shouldn't be.  Looks like this is working fine during remerge and upgrade.

> 
> Showing differences between /var/spool/torque/server_name and
> /var/spool/torque/._cfg0000_server_name
> --- /var/spool/torque/server_name       2009-12-01 15:25:44.000000000 +0100
> +++ /var/spool/torque/._cfg0000_server_name     2009-12-03 23:32:12.000000000
> +0100
> @@ -1 +1 @@
> -nfssrv.cluster.local
> +node001.cluster.local

Yeah, this shouldn't be happening on downgrade.  I'll look further into it.

Comment 7 Justin Bronder (RETIRED) gentoo-dev 2010-03-02 02:03:11 UTC
Server name detection was moved in 2.3.7-r1.  Look for 2.3.10 to be marked stable with this fix.