Hi there, I have a reproducable bug that already occurred on two of my x86 servers: /etc/init.d/bootmisc as found in sys-apps/baselayout-1.12.1 hangs on boot in line 115, while doing chown 0:0 /tmp/.{ICE,X11}-unix It's possible to continue the boot process by pressing <ctrl>-c, but that shouldn't be a permanent solution. I just wonder, if this line is really necessary. The X11 directories are freshly created on wipe as well as on clean, and during the boot process, this is obviously done by root, thus by uid:gid 0:0. Why then this chown anyway? Looking forward to be enlightened Torsten
if a simple chown is hanging, you've got bigger problems the chown/chmod are sanity checks to make sure we have correct permissions, just like the comment says above the code i'd suggest you drop the redirect to /dev/null and see if you get any errors ... if not, try adding `strace` before the chown call and see wtf is going on
Thanks for the tipp, strace indeed revealed the problem - and it's not a trivial one: We use an LDAP directory coupled with a kerberos V password database for authentication of "normal" users against Linux. The nsswitch.conf thus looks as follows: passwd: compat files ldap shadow: compat files ldap group: compat files ldap In addition there are pam_ldap and pam_krb5 installed - and called by /etc/pam.d/system-auth strace shows that chown 0:0 looks for the LDAP entries, but since there are no network interfaces up, it tries indefinetely to resolve the hostname of the LDAP server. A trivial solution seams to be to substitute chown 0:0 by chown root:root Don't ask me, why chown reacts that allergic to numeric IDs. Anyway, chown root:root seems pretty valid to me (a standard Linux system without a user "root" wouldn't be able to run anything, not even speaking of X11), so can it go to baselayout, please? Torsten
we use 0:0 because of stupid *BSD systems that map the root group to gid 10 and the wheel group to gid 0 seems odd that numeric id's are looked up in the database though ... is there a case where this would make sense (i cant think of one) ?
Ouch, *BSD - yes I know the problem with groups. Have two FreeBSD servers running myself. I also don't know, why numeric IDs are looked up and named ones not. I guess, there's something wrong with the way, nss_ldap handles numeric IDs. There was a revision bump lately, maybe the error lies there. Question is, whether it depends on one of the patches applied in the nss_ldap ebuild, or whether the good folks at PADL software built the sh*t directly into their software. I will examine that further on monday.
I somehow narrowed the problem down. The standard bootmisc works smoothly up to sys-auth/nss_ldap-2.39-r1 The next version in portage is: sys-auth/nss_ldap-2.49 And this one as well as all following ones tries to resolve the numeric IDs. My knowledge of C is insufficient to see what leads to that change, so perhaps someone else should have a look...
np, we have a dev who "loves" this package ;)
vapier: I'm looking at it, but could you also please look at coreutils for chown? By default it does a getpwnam/getgrnam on the arguments that are passed in. If this fails (returning NULL instead of a pointer to a struct), they try to convert the argument to a numeric value. I'm interested if this behavior has always been this way? The getpwnam/getgrnam would always go to your configured NSS source, so this may be a repeat of the lookup delay issues: Torston: could you please try nss_ldap-250-r1 and see EXACTLY how long the delays are? (250-r1 changes the timeout behavior of nss_ldap on purpose). In 249, it shouldn't actually be a hang, but a very long timeout (nearly 5 minutes for each lookup).
Ok, boys, I measured the timeouts. They appear to be exactly 30 seconds with 250-r1. That's indeed not a very long time compared to BIOS posts, SCSI-detection, etc... Anyway, there must be a way to keep nss_ldap from even looking if there is none of the network devices up yet (which is definitely the case during that early boot stage)
ok, if they are 30 seconds with 250-r1, then they were definetly significently longer on older versions. spanky: in chown from coreutils, could the logic possibly be changed to see that the passed in value is numeric instead of trying to look it up and only after that fails converting to numeric? The file you'd need to change is ${S}/lib/userparse.c Torsten: the problem is that you can't differentiate between the remote LDAP server being totally down, and the local network being down. Both cases are the same effective error returned to Linux.
*** Bug 142626 has been marked as a duplicate of this bug. ***
base-system: please read the summary below, and fix coreutils asap. I was asked why this doesn't seem to behave. here's a short summary of what happens: 1. user calls 'chown 0:0 foo' 2. chown splits this into two STRINGS, user="0", group="0" 3. chown (via some code in the lib/ portion of coreutils), does getpwnam("0"), getgrnam("0"). 4. this causes NSS to go and look for a user and group with a NAME of "0". Notice not a number of zero, but a string name of "0". 5. NSS checks files for a user/group named "0". Finds nothing. 6. NSS checks ldap for a user/group named "0". LONG delay happens here because the LDAP server (if local) is not yet started or (if remote) is not yet accessible (networking isn't up). 7. chown code decides that if nothing was found so far, try to convert it to a number. This succeeds, and the chown is actually done at this point. #7 needs to move way up, to realize that the input is a numeric value, and not a string, and should not be looked up at a name.
(In reply to comment #11) I agree with your conclusion but upgrading nss_ldap didn't give any improvement :( As you asked me to upgrade to nss_ldap-250-r1 and time the exact delay until the ldap request timed out, i'm able to confirm the delay for the lookup is ... longer than 30 minutes. That is further then the 30 seconds awaited. Tired to wait for a none coming response, i finally stopped the process. So, chown has never returned and you will certainly be disappointed by the following result: + mkdir -p /tmp/.ICE-unix /tmp/.X11-unix + date jeu ao
(In reply to comment #11) I agree with your conclusion but upgrading nss_ldap didn't give any improvement :( As you asked me to upgrade to nss_ldap-250-r1 and time the exact delay until the ldap request timed out, i'm able to confirm the delay for the lookup is ... longer than 30 minutes. That is further then the 30 seconds awaited. Tired to wait for a none coming response, i finally stopped the process. So, chown has never returned and you will certainly be disappointed by the following result: + mkdir -p /tmp/.ICE-unix /tmp/.X11-unix + date jeu aoû 3 21:10:54 MEST 2006 + chown 0:0 /tmp/.ICE-unix /tmp/.X11-unix Ctrl+c (30 minutes is really really time consuming :)) Moreover i didn't set the idle_timelimit in /etc/ldap.conf and let it simply to its default value (certainly 3600 seconds). Jj
(In reply to comment #11) It's late, it's time for me to go bed. Tomorrow is an other working day ;) I forgot to add the timings you mentioned so i did it in an other test but without more success. #cat /etc/ldap.conf ... nss_reconnect_tries 4 # number of times to double the sleep time nss_reconnect_sleeptime 1 # initial sleep value nss_reconnect_maxsleeptime 16 # max sleep value to cap at nss_reconnect_maxconntries 2 # how many tries before sleeping # This leads to a delay of 15 seconds (1+2+4+8=15) After replacing chown 0:0 with chown root:root, bootmisc doesn't lock anymore. Definitely you were right. Good night Jj
*** Bug 142790 has been marked as a duplicate of this bug. ***
sorry, but this is by design and is required by spec: http://www.opengroup.org/onlinepubs/009695399/utilities/chown.html OPERANDS The following operands shall be supported: owner[:group] A user ID and optional group ID to be assigned to file. The owner portion of this operand shall be a user name from the user database or a numeric user ID. Either specifies a user ID which shall be given to each file named by one of the file operands. If a numeric owner operand exists in the user database as a user name, the user ID number associated with that user name shall be used as the user ID. Similarly, if the group portion of this operand is present, it shall be a group name from the group database or a numeric group ID. Either specifies a group ID which shall be given to each file. If a numeric group operand exists in the group database as a group name, the group ID number associated with that group name shall be used as the group ID. what this means is that if you have "0" as a username, then the uid associated with that username will utilized rather than the numeric uid 0 so add this to the end of your /etc/passwd: 0:x:3456:3456::/:/bin/false then run: touch foo chown 0 foo stat -c%u foo notice how the output is uid 3456, not uid 0
Spanky: a LOT of other stuff in the system forbids numeric values as usernames. # useradd -u 3456 -g 100 -s /bin/false 0 useradd: invalid user name '0' (add it manually now instead) # echo "0:x:3456:100:testcase:/tmp:/bin/false" >>/etc/passwd (now show how getent is broken) # getent passwd 0 root:x:0:0:root:/root:/bin/bash The one alternative to not fixing this is to write a service that rotates the correct nsswitch into place at the correct time, which isn't an easy task. (uberlord tried a few variation ideas on it i know). A different alternative would be to find a chown-like tool that can explicitly be told that it's input is a numeric uid/gid and should not be looked up otherwise.
> a LOT of other stuff in the system forbids numeric values as usernames. what's your point ? chown has a spec that is accepted by everyone, it is certainly not our place to go changing that behavior is said behavior stupid ? certainly is imho, but it's in the spec, thus it will always retain that behavior until the POSIX/IEEE/whoever changes their mind > A different alternative would be to find a chown-like tool that can explicitly > be told that it's input is a numeric uid/gid and should not be looked up > otherwise. what i was thinking of was asking the coreutils guys what they thought of a flag to chown/chgrp that explicitly forces numeric ids to not be looked up ... like a -n flag or something
+1 on the -n numeric flag. I'll even code it if they like the idea.
or change bootmisc to 'use net'
'use net' does solve it for those with a local LDAP server, and is also a conflict with runlevels, since bootmisc is in boot, and net is in default.
fixed in svn by dropping the chown as it is just a sanity check this will cause problems for people who run the `mkdir` as a non root user, but then again in that case the `chown` would have failed anyways as non-root users cannot chown to 0:0
Created attachment 93767 [details, diff] coreutils-numeric.patch Adds -n and --numeric to chown and chgrp. I tried to change as little code as possible. I'm sure you won't like it Spanky.