477804 – user.eclass: enewuser fails on parallel jobs

Bug 477804 - user.eclass: enewuser fails on parallel jobs

Summary: user.eclass: enewuser fails on parallel jobs

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Eclasses (show other bugs)
Hardware:	All Linux

Importance:	Low normal
Assignee:	Gentoo's Team for Core System packages

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-07-22 20:20 UTC by Rick Farina (Zero_Chaos)
Modified:	2022-12-07 00:01 UTC (History)
CC List:	4 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
possible fix (parallel_linux_useradd.txt,1.52 KB, patch) 2013-07-23 04:43 UTC, Rick Farina (Zero_Chaos)	Details \| Diff
potential fix that preserves an ebuild's default if possible (file_477804.txt,725 bytes, patch) 2013-08-15 18:34 UTC, Doug Goldstein (RETIRED)	Details \| Diff
Don't search for an unused UID on Linux, but still allow suggested UIDs from ebuilds (file_477804.txt,1.53 KB, patch) 2013-09-01 09:34 UTC, dwfreed	Details \| Diff
Don't search for an unused UID/GID on Linux, but still allow suggested UIDs/GIDs from ebuilds (file_477804.txt,3.57 KB, patch) 2013-12-19 06:04 UTC, dwfreed	Details \| Diff
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Rick Farina (Zero_Chaos) gentoo-dev

2013-07-22 20:20:31 UTC

I just hit the oddest error:

 * Adding group 'dnsmasq' to your system ...
 *  - Groupid: next available
 * Adding user 'dnsmasq' to your system ...
 *  - Userid: 118
 *  - Shell: /sbin/nologin
 *  - Home: /dev/null
 *  - Groups: dnsmasq
 *  - GECOS: added by portage for dnsmasq
useradd: UID 118 is not unique
 * ERROR: net-dns/dnsmasq-2.66 failed (setup phase):
 *   (no error message)


When I checked userid 118 was already assigned to radvd.  the only scenario I can piece together (since both eclass and ebuild haven't been changed in a while and look sane) is that portage ran pkg_setup on both at the same time and collided.

I am currently building with --jobs=16 and FEATURES="-ebuild-locks parallel-install".  I've never seen any collisions like this (and I've been running like this for a few weeks now with multiple catalyst runs a day) so it is a fairly difficult thing to trigger.

Anyone care to take a stab at resolving this?

Comment 1 Mike Gilbert gentoo-dev

2013-07-22 20:55:54 UTC

enewuser is looping through uids to find the next one available. Why don't we just call useradd without the -u option and let it do the proper locking?

Comment 2 Rick Farina (Zero_Chaos) gentoo-dev

2013-07-23 03:21:09 UTC

Running useradd in linux automatically finds the next available uid, and -r is passed already so we get 100-999.  All that really needs to be done for linux to work properly is to drop the searching part and just run useradd instead.

The quickest and easiest fix for the major of our users (linux users) would be to conditional out this block of code so linux users don't run it at all:

        # handle uid
        local euid=$1; shift
        if [[ -n ${euid} && ${euid} != -1 ]] ; then
                if [[ ${euid} -gt 0 ]] ; then
                        if [[ -n $(egetent passwd ${euid}) ]] ; then
                                euid="next"
                        fi
                else
                        eerror "Userid given but is not greater than 0 !"
                        die "${euid} is not a valid UID"
                fi
        else
                euid="next"
        fi
        if [[ ${euid} == "next" ]] ; then
                for ((euid = 101; euid <= 999; euid++)); do
                        [[ -z $(egetent passwd ${euid}) ]] && break
                done
        fi
        opts+=( -u ${euid} )

Comment 3 Rick Farina (Zero_Chaos) gentoo-dev

2013-07-23 04:02:11 UTC

*-darwin*) cannot automatically assign uid

*-freebsd*|*-dragonfly*) can automatically assign uid but it's >1000 which would change the behavior of the eclass.

*-netbsd*) not tested
*-openbsd*) not tested


*) in this case linux, when testing the uid is automatically picked between 100 and 999 thanks to the -r flag.  It is recommended to either drop -u uid from the default case, or much more safely add a new "-linux*" that skips the uid selection stuff and avoids the issue entirely.  I'll work up a patch and we can see if there is a sane way to do this.

Comment 4 Rick Farina (Zero_Chaos) gentoo-dev

2013-07-23 04:43:43 UTC

Created attachment 353992 [details, diff]
possible fix

Comment 5 Rick Farina (Zero_Chaos) gentoo-dev

2013-08-15 17:14:35 UTC

Because this is a base system package I am waiting more than the normal 2 weeks but please note, I intend to commit this fix next week unless there are technical comments pointing out an issue with that.

Comment 6 Doug Goldstein (RETIRED) gentoo-dev

2013-08-15 18:20:15 UTC

(In reply to Rick Farina (Zero_Chaos) from comment #5)
> Because this is a base system package I am waiting more than the normal 2
> weeks but please note, I intend to commit this fix next week unless there
> are technical comments pointing out an issue with that.

If a package supplies a uid, with this change it will never be used as the default. This change will result in the next available uid always being used, which is not always desirable.

So NACK this change.

Comment 7 Doug Goldstein (RETIRED) gentoo-dev

2013-08-15 18:34:32 UTC

Created attachment 356106 [details, diff]
potential fix that preserves an ebuild's default if possible

Comment 8 Doug Goldstein (RETIRED) gentoo-dev

2013-08-15 18:36:53 UTC

Comment on attachment 356106 [details, diff]
potential fix that preserves an ebuild's default if possible

Eh now that I think about it, you still have a race so this isn't good. Realistically you need to attempt to add it, if it fails then add it without the uid set on Linux.

Comment 9 Rick Farina (Zero_Chaos) gentoo-dev

2013-08-15 18:45:04 UTC

> If a package supplies a uid, with this change it will never be used as the
> default. This change will result in the next available uid always being
> used, which is not always desirable.
> 
> So NACK this change.

I don't see what you are talking about:

                opts+=( -u ${euid} )
                einfo " - Userid: ${euid}"

I changed the default to not specify a uid, I didn't remove the ability to specify uid unless there was an unintentional screw up.  Can you please point out what I've done wrong?

Comment 10 Rick Farina (Zero_Chaos) gentoo-dev

2013-08-18 04:58:52 UTC

Comment on attachment 353992 [details, diff]
possible fix

After confirming that the uide setting functions work fine, I'm unobsoleting my fix.

Comment 11 Rick Farina (Zero_Chaos) gentoo-dev

2013-08-18 04:59:40 UTC

Comment on attachment 356106 [details, diff]
potential fix that preserves an ebuild's default if possible

As noted by comment #8 from original author, this doesn't solve the issue.

Comment 12 Rick Farina (Zero_Chaos) gentoo-dev

2013-08-18 05:02:57 UTC

I would really like this to go into the tree, I can't be the only one doing autobuilds that gets hit by this.  I know several other developers with systems equal to or better than mine.  Yes it's a tough race to hit, but the patch works fine, tested and confirmed, and seems to fix the race for me.  I would still like to have this committed next week as there are currently no valid complaints afaict.

Comment 13 dwfreed 2013-09-01 06:33:54 UTC

Comment on attachment 353992 [details, diff]
possible fix

> +		opts+=( -u ${euid} )
> +		einfo " - Userid: ${euid}"

This only applies if CHOST != linux, making the arguments raised by comment #6 still valid.  I see an easy way to correct this, and will follow up with a better patch.

Comment 14 dwfreed 2013-09-01 09:34:23 UTC

Created attachment 357554 [details, diff]
Don't search for an unused UID on Linux, but still allow suggested UIDs from ebuilds

This is an updated version of attachment #353992 [details, diff].  In addition to the intentions of the original patch, it brings back the functionality of suggesting a UID to use in ebuilds for Linux hosts.  This should address all qualms raised by comment #6 in an effective manner.  It also greatly condenses and removes code duplication of both the original code and the previous patch.

However, it does not solve the specific race condition raised in comment #8.  All of the ebuilds in the tree specify a UID for standardization or hysterical reasons, and there are few chances of collisions.  The only chances come from packages that use UIDs just above 100, and these are high enough to be unlikely to cause a collision, and are even less likely to trigger the race condition and cause the build to fail.  Because this race condition is extremely difficult to hit, and correcting the race condition isn't going to be pretty, at least from my quick look at the enewuser function while working on this fix, I don't believe it's worth the effort to investigate at the present time.  We'll cross that bridge if we get there.

Please note that I have not had an opportunity to test this code.  While it is theoretically sound, it may contain practical bugs.  I offer up Zero_Chaos to test it, since he has a suitable system for doing so.

Comment 15 Rick Farina (Zero_Chaos) gentoo-dev

2013-10-09 04:22:57 UTC

Perhaps a similar issue?

 * Adding group 'fcron' to your system ...
 *  - Groupid: next available
groupadd: cannot lock /etc/group; try again later.
 * ERROR: sys-process/fcron-3.1.1::gentoo failed (setup phase):


The question is, is there a sane way to fix this or is the best we can do just call it until it works?

One the bright side, the useradd doesn't seem to have issues after the patch from dwfreed.

Comment 16 Mike Gilbert gentoo-dev

2013-10-09 04:47:48 UTC

(In reply to Rick Farina (Zero_Chaos) from comment #15)

Looking through the shadow code, it looks like both useradd and groupadd ultimately call lckpwdf(3) to obtain a lock. If a lock is not obtained in 15 seconds, lckpwdf times out and you would get that "cannot lock" message.

I have to wonder how you could be generating enough I/O to delay useradd or groupadd by that amount of time.

Comment 17 Rick Farina (Zero_Chaos) gentoo-dev

2013-10-09 05:07:26 UTC

(In reply to Mike Gilbert from comment #16)

> I have to wonder how you could be generating enough I/O to delay useradd or
> groupadd by that amount of time.

SATA1 hdd, running two catalyst instances in parallel with MAKEOPTS="-j48 -l32" and EMERGE_DEFAULT_OPTS="--jobs=64 --load-average=32"

I am not even close to cpu bound, but my IO is off the charts pegged all the time and everything lives in IOwait

Comment 18 dwfreed 2013-12-19 06:04:50 UTC

Created attachment 365648 [details, diff]
Don't search for an unused UID/GID on Linux, but still allow suggested UIDs/GIDs from ebuilds

This extends attachment #357554 [details, diff] to also apply to groups, and moves the ID handling code into a separate function to reduce code duplication.  While here, I noticed that in enewgroup, CHOST matching openbsd resulted in groupadd being called with -r, which is an undefined flag in OpenBSD, so I adjusted the case statement to group OpenBSD with NetBSD to fix that.

To extend the list of CHOSTs that can automatically pick an unused ID for both users and groups out of the system range, you simply need to modify this one function to match more CHOSTs, and perhaps make small tweaks to how the CHOST-specific useradd- and groupadd-alike commands are called to take advantage of this.

Please note that I have not had an opportunity to test this code.  While it is theoretically sound, it may contain practical bugs.  Somebody using each of the Gentoo Prefix environments supported should probably give it a test to ensure it still Does the Right Thing(TM).

Comment 19 Rick Farina (Zero_Chaos) gentoo-dev

2013-12-19 18:31:56 UTC

Comment on attachment 353992 [details, diff]
possible fix

marking my original fix obsolete as we have better ones on the bug

Comment 20 Mike Gilbert gentoo-dev

2022-12-07 00:01:24 UTC

user.eclass has been removed from the gentoo repo.