232990 – 2.2_rc2 --jobs breaks PMS's invariants

Bug 232990 - 2.2_rc2 --jobs breaks PMS's invariants

Summary: 2.2_rc2 --jobs breaks PMS's invariants

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Core (show other bugs)
Hardware:	All Linux

Importance:	High normal
Assignee:	Portage team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-07-26 10:40 UTC by Łukasz Michalik
Modified:	2008-09-07 16:22 UTC (History)
CC List:	4 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Łukasz Michalik 2008-07-26 10:40:06 UTC

PMS 11.4 states that there shall be no variancy between pkg_setup and pkg_postinst, however --jobs allows that to happen by performing merges of other packages between those phases.

Steps to reproduce:
Prepare ebuilds a, b and c:

a/a-1.ebuild:
SLOT="0"
KEYWORDS="x86"
IUSE=""
pkg_setup() { echo `date` "pkg_setup done for ${PN}" ; }
src_compile() { sleep 10; }
pkg_preinst() { echo `date` "pkg_preinst done for ${PN}" ; }
pkg_postinst() { echo `date` "pkg_postinst entered for ${PN}" ; while true ; do echo "postinstall for ${PN}" ; sleep 30 ; done ; }

b/b-1.ebuild:
SLOT="0"
KEYWORDS="x86"
IUSE=""
pkg_setup() { sleep 10; echo `date` "pkg_setup done for ${PN}"; }
src_compile() { while true ; do echo "src_compile for ${PN}" ; sleep 10 ; done ; }

Package c should DEPEND on both of the above.

Now run `emerge c -a --jobs=2', and observe build.logs.  Cutting out irrelevant text and sorting chronologicly one can obtain:

Sat Jul 26 14:21:57 CEST 2008 pkg_setup done for a
Sat Jul 26 14:22:06 CEST 2008 pkg_setup done for b
Sat Jul 26 14:22:12 CEST 2008 pkg_preinst done for a
Sat Jul 26 14:22:13 CEST 2008 pkg_postinst entered for a

Comment 1 Zac Medico gentoo-dev

2008-07-26 11:50:07 UTC

(In reply to comment #0)
> PMS 11.4 states that there shall be no variancy between pkg_setup and
> pkg_postinst, however --jobs allows that to happen by performing merges of
> other packages between those phases.

Is it really necessary for the rule to be so strict?

Comment 2 Ciaran McCreesh 2008-07-26 12:32:20 UTC

(In reply to comment #1)
> Is it really necessary for the rule to be so strict?

Yes, unfortunately it is.

The way around it is to build binary packages in parallel, ensuring that only one pkg_setup is running at a time, and then install them linearly (rerunning pkg_setup as you would for binaries). That way none of the existing behaviours ebuilds expect are changed, and the only things that break are those that already break for binary packages.

Comment 3 Zac Medico gentoo-dev

2008-07-26 12:45:56 UTC

Can't you build something like qt and gtk in parallel and not have to worry about it?

Comment 4 Stephen Bennett 2008-07-26 13:12:56 UTC

There are cases when you can do this and nothing too bad will happen, but unfortunately there is no reliable way to know what those cases are. In the general case, then, one has to assume that you can't.

Comment 5 Zac Medico gentoo-dev

2008-07-26 14:23:18 UTC

The algorithm that portage currently uses is to traverse the subgraph of deep dependencies of a given package and build it in parallel only when there are no merges scheduled within that subgraph. How about if we amend PMS 11.4 to allow for this?

Comment 6 Ciaran McCreesh 2008-07-26 16:23:53 UTC

No go. Consider, for example, two plugins for the same program. They aren't interdependent in any way, yet there's still a parallelism constraint upon them.

PMS is worded that way because it's the laxest set of rules possible that don't impose changes upon ebuild behaviour under parallelism. Any ebuild that breaks under the rules in PMS will also break when used as a binary package; this isn't the case if you start introducing changes to / between pkg_setup and pkg_preinst.

Comment 7 Zac Medico gentoo-dev

2008-08-04 19:25:18 UTC

I think the way portage does it is fine.

Comment 8 Ciaran McCreesh 2008-08-04 19:30:46 UTC

Package a reads something in from ROOT in pkg_setup and sticks a modified version back to ROOT in post_preinst. Package b reads the same thing in from ROOT in pkg_setup and sticks a different modified version back in pkg_postinst. Previously this would work, even with binary packages. Now it won't.

For examples of 'reads something in', consider things calling 'eselect opengl'.

Comment 9 Zac Medico gentoo-dev

2008-08-05 03:12:41 UTC

We haven't had any reports of problems caused by portage's current behavior.

Comment 10 Ciaran McCreesh 2008-08-05 03:19:33 UTC

So? People haven't happened to have run something that, on their particular system under whatever load it is under, triggers the race yet. Or if they have, they've rerun the build and seen it mysteriously work, and gone no further.

We're talking race conditions here, which means an annoying source of inconsistent, difficult to reproduce, very weird breakages. You don't fix them when someone comes across a consistent, reproducible problem. You design so they can't happen.

Comment 11 Zac Medico gentoo-dev

2008-08-05 03:36:08 UTC

Well, unless this is observable in practice then I don't care.

Comment 12 Ciaran McCreesh 2008-08-05 03:37:37 UTC

It is. It's observable in practice by very weird, very hard to reproduce bugs.

Comment 13 Zac Medico gentoo-dev

2008-08-05 03:47:48 UTC

I believe that the risk is negligible.

Comment 14 Ciaran McCreesh 2008-08-05 03:54:46 UTC

Know all those parallel make bugs that keep on biting people? Know how much of a pain in the arse they are? Exactly the same issue, except that the impact of a failure can be a lot worse than a broken build.

Comment 15 Zac Medico gentoo-dev

2008-08-05 04:02:16 UTC

I still haven't seen any hard evidence to support your claims.

Comment 16 Ciaran McCreesh 2008-08-05 04:07:31 UTC

Comment #8 contains a full description of how there's a race condition. Which part of it don't you understand?

Comment 17 Zac Medico gentoo-dev

2008-08-05 04:14:18 UTC

Like I already said, I believe the risk is negligible.

Comment 18 Ciaran McCreesh 2008-08-05 04:21:18 UTC

You can believe whatever you want, but that doesn't make it true. Or do you believe that praying that no-one will get hit by the bugs that you know exist is the correct way to design software?

In any case, please update the documentation to say "There is a chance that using --jobs will completely break your system beyond any possibility of repair. The Portage authors believe it is a very small chance, and that everyone is lucky, so this probably won't happen."

Comment 19 Zac Medico gentoo-dev

2008-08-05 04:33:24 UTC

You can believe whatever you want but that doesn't make it true.

Comment 20 Ciaran McCreesh 2008-08-05 04:37:23 UTC

I already explained to you how it's broken. You agree that it's broken, but think that the chances of anyone actually noticing the breakage are very small. So why not document that it's broken but that you think people will probably get away with it?

Or better yet, why not just fix the thing? It's a simple change.

Comment 21 Kevin Bowling 2008-09-07 06:00:50 UTC

Zac, are you serious?

Ciaran points out a valid case where Portage can fail.  Just because you don't want or don't know how to fix the bug, how ever small a chance it can trigger, doesn't mean it doesn't exist.

Worksforme is an improper resolution.

Comment 22 Zac Medico gentoo-dev

2008-09-07 06:16:31 UTC

I contend that any problems can and should be fixed at the ebuild level.

Comment 23 Ciaran McCreesh 2008-09-07 13:10:03 UTC

So how does an ebuild specify that its pkg_setup has to be run invariantly with pkg_preinst, and who is going to go through and add that to every ebuild that hasn't been proven to be safe?

Comment 24 Zac Medico gentoo-dev

2008-09-07 15:53:11 UTC

The invariance is a natural consequence of dependency handling as mentioned in comment #5. If an relevant variance occurs then it is due to an unspecified dependency. If and when such a case is discovered, an ebuild or one of the ebuilds in the subgraph of it's deep dependencies needs to be updated to specify the missing dependency.

Note that problems with invariance can still occur due to unspecified dependencies, even if PMS 11.4 is strictly adhered to. So, adhering to PMS 11.4 doesn't really solve the root problem. It would only serve to hide some cases of variance and not others.

In practice, the current approach used by portage (comment #5) has proven to be quite reliable. Given that there may still be variance issues even if PMS 11.4 is strictly adhered to, I see not reason not to continue using the approach that portage currently uses.

Comment 25 Ciaran McCreesh 2008-09-07 16:00:47 UTC

We established in comment #6 that what you said in comment #5 is nonsense. And we're not aiming for 'quite reliable' here, we're aiming for 'doesn't break things'. Whilst sticking to what PMS allows doesn't quite guarantee that, not sticking to PMS is a lot worse.

Comment 26 Zac Medico gentoo-dev

2008-09-07 16:21:03 UTC

(In reply to comment #25)
> We established in comment #6 that what you said in comment #5 is nonsense.

What you've established in comment #6 is that some ebuilds may interact with the live file system in a way that is not compatible with the approach mentioned in comment #5. I'd like to see some specific examples of this so that we can make an attempt to fix them. If they can't be fixed for some reason, then we may have to restrict parallelization for those specific ebuilds. It doesn't seem worthwhile to adhere to PMS 11.4 when, as established in comment #24, it's not a cure-all as it only serves as a workaround for some poorly behaved ebuilds and won't help for some other poorly behaved ebuilds.

(In reply to comment #25)
> And
> we're not aiming for 'quite reliable' here, we're aiming for 'doesn't break
> things'. Whilst sticking to what PMS allows doesn't quite guarantee that, not
> sticking to PMS is a lot worse.

In practice the current approach used by portage has been shown to be 'a lot worse'. I appreciate that you're aiming to avoid breakage but I think you've missed the mark and chosen a sub-optimal solution.

Comment 27 Zac Medico gentoo-dev

2008-09-07 16:22:54 UTC

(In reply to comment #26)
> In practice the current approach used by portage has been shown to be 'a lot
> worse'. I appreciate that you're aiming to avoid breakage but I think you've
> missed the mark and chosen a sub-optimal solution.

s/has been/has not been/