I did not check the ebuild code in much detail but it looks that on a running and configured system the ebuild fails to determine the server_name and thus in the end writes current(node) hostname into the file. Happens on stable amd64.
It sounds like you want torque to interpret and set PBS_SERVER_NAME at install time instead of configure time (how it's done now). Setting that variable to build the package would require building the same package for each node in turn with the variables set accordingly during each build run, whereas you would ideally want to set the variable and use it at install time, so that each node is uniquely configured with its own PBS_SERVER_NAME variable. Right?
I do not think I bother whether configure or install step, really. Somehow, recompilation (actually downgrade from 2.4.2-beta to 2.3.6) re-wrote my /var/spool/torque/server_name and zapped the proper hostnames of each node with "localhost". I think the logic in the ebuild is breaking, do not know why. But, importantly, I think I should have been asked by etc-update whether I want to change the files. That did not happen so I think that is another issue with the 2.3.6 ebuild at least. Nodes correctly return names on hostname(1) and domainname(1) commands. Commenting on your second paragraph ... I configure&compile&install on each node separately although it is wasteful. I do not think I mind if the server hostname is determined a bit earlier or later. It is not recognized at the moment at all so "localhost" kicks in.
(In reply to comment #2) > I do not think I bother whether configure or install step, really. Somehow, > recompilation (actually downgrade from 2.4.2-beta to 2.3.6) Can you reproduce using only version that are in the portage tree?
The detection for the server_name does suck when PBS_SERVER_NAME is not defined and should be moved to pkg_setup(). However, /var/spool/torque/server_name is config protected, so I suspect you lost the file to whatever method you used to install torque-2.4.2-beta. mejis openmpi # etc-update Scanning Configuration files... The following is the list of files which need updating, each configuration file is followed by a list of possible replacement files. 1) /var/spool/torque/server_name (1)
I do not know why, but the /var/spool/torque/._cfg0000_server_name was NOT created while installing 2.3.6. Then I unmasked 2.3.7 for amd64, installed that, still not this file reported by etc-update. Finally, to return to a stable version i downgraded to 2.3.6. And, here we go: Showing differences between /var/spool/torque/server_name and /var/spool/torque/._cfg0000_server_name --- /var/spool/torque/server_name 2009-12-01 15:25:44.000000000 +0100 +++ /var/spool/torque/._cfg0000_server_name 2009-12-03 23:32:12.000000000 +0100 @@ -1 +1 @@ -nfssrv.cluster.local +node001.cluster.local Sorry, I do not know why I started to mention "localhost" value. It places current hostname into the file. Maybe it has to do with the USE=server flag which is maybe NOT expected to be set on the nodes? But I do have it set because many packages will not provide some deamons I want to have. # emerge -pv =sys-cluster/torque-2.3.6 These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild R ] sys-cluster/torque-2.3.6 USE="crypt server -cpusets -doc -syslog -tk" 0 kB # emerge -pv =sys-cluster/torque-2.3.7 These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild U ] sys-cluster/torque-2.3.7 [2.3.6] USE="crypt server syslog* -cpusets -doc -drmaa% -tk" 0 kB #
(In reply to comment #5) > I do not know why, but the /var/spool/torque/._cfg0000_server_name was NOT > created while installing 2.3.6. Then I unmasked 2.3.7 for amd64, installed > that, still not this file reported by etc-update. Finally, to return to a > stable version i downgraded to 2.3.6. And, here we go: It's not going to be created if there are not differences between the files, which there shouldn't be. Looks like this is working fine during remerge and upgrade. > > Showing differences between /var/spool/torque/server_name and > /var/spool/torque/._cfg0000_server_name > --- /var/spool/torque/server_name 2009-12-01 15:25:44.000000000 +0100 > +++ /var/spool/torque/._cfg0000_server_name 2009-12-03 23:32:12.000000000 > +0100 > @@ -1 +1 @@ > -nfssrv.cluster.local > +node001.cluster.local Yeah, this shouldn't be happening on downgrade. I'll look further into it.
Server name detection was moved in 2.3.7-r1. Look for 2.3.10 to be marked stable with this fix.