Bug 296710

Summary:	equery called in global scope of ebuild triggers fork bomb
Product:	Portage Development	Reporter:	Till Korten <webmaster>
Component:	Core - Ebuild Support	Assignee:	Portage team <dev-portage>
Status:	RESOLVED FIXED
Severity:	critical	CC:	che
Priority:	High	Keywords:	InVCS
Version:	unspecified
Hardware:	x86
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---
Bug Depends on:
Bug Blocks:	288499
Attachments:	output of emerge --info the troublesome ebuild file required for the ebuild file required for the ebuild file required for the ebuild file required for the ebuild file required for the ebuild here is the manifest with which the problem occurs

Description Till Korten 2009-12-13 11:47:01 UTC

I run "emerge --sync && emerge -uDN world" every two days via cron.
now, approximately once a week (usually on saturday or sunday). Portage will eat up all memory while calculating dependencies.
This crashes the whole machine requiring a hard reset!

I have narrowed down the problem to one ebuild which I wrote myself (see attachment).
Please note, that I do not want to debug that particular ebuild here, but I think that no ebuild should be able to cause portage to crash the whole machine, therefore I am reporting the bug here.

Some more details:
After the machine crashes, trying to emerge world or that particular ebuild, or trying to digest that ebuild will again cause portage to eat up memory real fast (~20 megs/s) and eventually crash the machine unless the process is aborted.

Once I delete the ebuild's Manifest and do a digest (ebuild xxx.ebuild digest), everything is fine for another week, until the problem occurs again.

Reproducible: Sometimes

Steps to Reproduce:
1.digest that troublesome ebuild
2.emerge the ebuild
3.run emerge --sync && emerge -uDN world every two days
4.Wait one week

Actual Results:  
portage eats up all memory until the machine crashes

Expected Results:  
portage should realize that something is wrong with the ebuild, throw an error and abort without crashing the machine.

emerge --info is attached

Comment 1 Till Korten 2009-12-13 11:52:02 UTC

Created attachment 212861 [details]
output of emerge --info

Comment 2 Till Korten 2009-12-13 12:00:44 UTC

Created attachment 212863 [details]
the troublesome ebuild

please note again, I don't want to debug that ebuild here. This ebuild is merely to demonstrate the problem. However, of course it will probably necessary to figure out what is wrong with that ebuild in order to fix portages reaction towards that error.

Comment 3 Till Korten 2009-12-13 12:01:09 UTC

Created attachment 212864 [details]
file required for the ebuild

Comment 4 Till Korten 2009-12-13 12:01:25 UTC

Created attachment 212865 [details]
file required for the ebuild

Comment 5 Till Korten 2009-12-13 12:02:11 UTC

Created attachment 212867 [details]
file required for the ebuild

Comment 6 Till Korten 2009-12-13 12:02:27 UTC

Created attachment 212868 [details]
file required for the ebuild

Comment 7 Till Korten 2009-12-13 12:02:41 UTC

Created attachment 212870 [details]
file required for the ebuild

Comment 8 Till Korten 2009-12-13 12:04:37 UTC

p.s. The ebuild was in my portage overlay (/usr/local/portage/app-backup/backuppc)

Comment 9 Zac Medico gentoo-dev

2009-12-13 23:13:16 UTC

If you can reproduce the problem, please try to get a backtrace with the SIGUSR1 procedure described in bug 266853, comment #3.

Comment 10 Till Korten 2009-12-14 07:15:37 UTC

this time it happened trying to digest the ebuild, so I killed the ebuild.sh command as described

however, the result was not very informative (as far as I can tell):
Sandboxed process killed by signal: User defined signal 1
close failed in file object destructor:
Error in sys.excepthook:

Original exception was:
close failed in file object destructor:
Error in sys.excepthook:

the last three lines were repeated several times (20 x or so)...

Comment 11 Till Korten 2009-12-14 07:20:41 UTC

now, I tried again and this time killall did not help, I had to do ^C and that gave me a traceback:

Sandboxed process killed by signal: User defined signal 1
^CTraceback (most recent call last):
  File "/usr/bin/equery", line 25, in <module>
    import gentoolkit
  File "/usr/lib/gentoolkit/pym/gentoolkit/__init__.py", line 46, in <module>
    settings = portage.config(clone=portage.settings)
  File "/usr/lib/portage/pym/portage/__init__.py", line 1223, in __init__
daffy backuppc #     if clone:
  File "/usr/lib/portage/pym/portage/proxy/objectproxy.py", line 66, in __nonzero__
    return bool(object.__getattribute__(self, '_get_target')())
  File "/usr/lib/portage/pym/portage/__init__.py", line 8254, in _get_target
    init_legacy_globals()
  File "/usr/lib/portage/pym/portage/__init__.py", line 8339, in init_legacy_globals
    db = create_trees(**kwargs)
  File "/usr/lib/portage/pym/portage/__init__.py", line 8216, in create_trees
    config_incrementals=portage.const.INCREMENTALS)
  File "/usr/lib/portage/pym/portage/__init__.py", line 1503, in __init__
    expand=expand_map)
  File "/usr/lib/portage/pym/portage/util.py", line 483, in getconfig
    mykeys[key] = varexpand(val, expand_map)
  File "/usr/lib/portage/pym/portage/util.py", line 512, in varexpand
    while (pos<len(mystring)):
KeyboardInterrupt
close failed in file object destructor:
Error in sys.excepthook:

Comment 12 Till Korten 2009-12-14 07:40:36 UTC

In top I see a lot of python 2.6 processes popping up, each taking about 20MB memory and 2.5 seconds processing time.

and I just remembered that the problem first appeared after the upgrade to python 2.6.

I hope this helps...

Comment 13 Zac Medico gentoo-dev

2009-12-14 07:41:52 UTC

(In reply to comment #10)
> this time it happened trying to digest the ebuild, so I killed the ebuild.sh
> command as described

The SIGUSR1 thing only works for the python ebuild program. If the problem is in ebuild.sh then --debug option is going to be more helpful since that will give us a trace of what's happening in the bash interpreter.

(In reply to comment #11)
>   File "/usr/lib/portage/pym/portage/__init__.py", line 1503, in __init__
>     expand=expand_map)
>   File "/usr/lib/portage/pym/portage/util.py", line 483, in getconfig
>     mykeys[key] = varexpand(val, expand_map)
>   File "/usr/lib/portage/pym/portage/util.py", line 512, in varexpand
>     while (pos<len(mystring)):

Here it was reading the content of /etc/profile.env. I checked that loop and it doesn't seem possible for the loop to be infinite. Is there anything strange about your /etc/profile.env? Is it's size reasonable? Sometimes filesystem corruption can make files appear to be abnormally large and cause problems for programs that try to read their entire content.

Comment 14 Till Korten 2009-12-14 07:42:59 UTC

Created attachment 212961 [details]
here is the manifest with which the problem occurs

Comment 15 Till Korten 2009-12-14 08:02:51 UTC

Unfortunately, I was too quick and deleted the Manifest.
After I delete the manifest, the digest works. The Manifest
hasn't changed, though (I compared it with the version I uploaded). I also tried to change the timestamp of the Manifest to
an older time using touch but that did not have an effect, either.

Could it be that portage caches the ebuilds somewhere and the problem is in the
cached ebuild?

Comment 16 Till Korten 2009-12-14 08:06:27 UTC

my /etc/profile.env is 1.6kB... (as reported by both ls and stat). this should be correct, right?

Since the problem seemed to occur every week, especially on the weekends, I am going to check whether there is something in my crontab that could cause a race condition...

Comment 17 Zac Medico gentoo-dev

2009-12-14 08:13:07 UTC

(In reply to comment #16)
> my /etc/profile.env is 1.6kB... (as reported by both ls and stat). this should
> be correct, right?

That's normal.

Comment 18 Till Korten 2009-12-14 08:58:05 UTC

I checked the crontab, but there is nothing happening within 2 hours of the emerge --sync && emerge -uDN world

so the checking of dependencies should be long done before any other process is started...

Comment 19 Zac Medico gentoo-dev

2009-12-15 07:04:21 UTC

(In reply to comment #12)
> In top I see a lot of python 2.6 processes popping up, each taking about 20MB
> memory and 2.5 seconds processing time.

That might be FEATURES=parallel-fetch. They shouldn't be a problem because only one is running at a given time.

OTOH, if you have lots of identical processes that are spawned simultaneously then you may be experiencing some kind of accidental forkbomb.

Comment 20 Till Korten 2009-12-15 14:38:46 UTC

> 
> That might be FEATURES=parallel-fetch. They shouldn't be a problem because only
> one is running at a given time.
> 

No, this was during "ebuild xxx.ebuild digest" of a single ebuild (the files of which were already downloaded on the system before), so definitely no parallel fetching!!!

Comment 21 Fabian Groffen gentoo-dev

2009-12-15 14:44:55 UTC

can it be that an endless loop occurs because equery is run in global scope requesting the slot of the same package, causing the metadata to be generated for the package, running equery in global scope, etc etc?

Comment 22 Till Korten 2009-12-15 16:31:38 UTC

(In reply to comment #21)
> can it be that an endless loop occurs because equery is run in global scope
> requesting the slot of the same package, causing the metadata to be generated
> for the package, running equery in global scope, etc etc?
> 
something like that, however as described this does not happen every time. Especially not initially after creating the Manifest or installing the ebuild. However at some point something outside of the ebuild gets screwed up. After this event (which happens approx every week on my machine) every time ebuild.sh or emerge is touching the ebuild, the memory leak occurs.

I suspected that the part that gets screwed up is this file:
/var/cache/edb/dep/usr/local/portage/app-backup/backuppc-3.1.0

And io and behold, deleting that file invariably triggers the problem.
also, deleting the last line of that file "_mtime_=1254015152"
or any of its characters will also trigger the problem

so I suspect that the following things happen: at some point the file 
/var/cache/edb/dep/usr/local/portage/app-backup/backuppc-3.1.0
gets truncated, corrupted or deleted (or not recreated fast enough). 

Then some bug in the ebuild triggers a loop (maybe the suggested equery loop?) which eats up all memory and crashes the system.

so as a fix, I suggest some consistency check, that checks if the cache file is correct or recreates it.
maybe the check could be if the _mtime_= line is there AND if a newline is at the end of the file.

Comment 23 Zac Medico gentoo-dev

2009-12-16 03:49:16 UTC

Calling equery in global scope like that can't be supported. It's impossible.

Comment 24 Zac Medico gentoo-dev

2009-12-16 03:54:16 UTC

What you are doing with equery looks strange. Anyway, you might be able to use the portageq command to do what you want without triggering a recursive forkbomb.

Comment 25 Zac Medico gentoo-dev

2009-12-16 04:03:15 UTC

Actually, we can fix this by using a lockfile. So then you'll get a deadlock instead of a forkbomb.

Comment 26 Zac Medico gentoo-dev

2009-12-16 04:05:10 UTC

(In reply to comment #25)
> Actually, we can fix this by using a lockfile. So then you'll get a deadlock
> instead of a forkbomb.

That said, it's probably not worth the trouble since it's very unlikely that anyone else will ever trigger this sort of thing.

Comment 27 Till Korten 2009-12-16 08:02:15 UTC

(In reply to comment #26)
> (In reply to comment #25)
> > Actually, we can fix this by using a lockfile. So then you'll get a deadlock
> > instead of a forkbomb.
> 
> That said, it's probably not worth the trouble since it's very unlikely that
> anyone else will ever trigger this sort of thing.
> 

Actually, I think this is a big security issue, since one single (purposefully or accidentally) tainted ebuild could cause EVERY gentoo machine out there to crash!

That said, what would be the correct way to make the ebuild install into the same slot as the previous installation?

Comment 28 Zac Medico gentoo-dev

2009-12-16 08:55:51 UTC

(In reply to comment #27)
> Actually, I think this is a big security issue, since one single (purposefully
> or accidentally) tainted ebuild could cause EVERY gentoo machine out there to
> crash!

Fork bombs are a well known issue that can happen whenever any kind of code is executed. If somebody is able to taint your code then they can do practically anything they want, and it's your responsibility to avoid tainted code. The only thing portage can do to help you in this area would be to check digital signatures in order to establish trust in the code being executed.

> That said, what would be the correct way to make the ebuild install into the
> same slot as the previous installation?

Set SLOT=0 inside the ebuild an leave it that way. The SLOT is supposed remain  constant for a given ebuild since the value is cached (among other reasons). We do have some "multislot" ebuilds that modify SLOT in unsanctioned ways, but it causes problems and therefore isn't recommended. We may add official support for dynamic SLOTs in some future EAPI (see bug 174407), but we haven't done it yet.

Comment 29 Till Korten 2009-12-16 10:15:01 UTC

> Fork bombs are a well known issue that can happen whenever any kind of code is
> executed. If somebody is able to taint your code then they can do practically
> anything they want, and it's your responsibility to avoid tainted code. The
> only thing portage can do to help you in this area would be to check digital
> signatures in order to establish trust in the code being executed.
> 
That makes totally sense... 

Do I understand this issue correctly? Any use of equery within an ebuild will cause a forkbomb. Or is it just the particular way I used it.

To me it does not seem so completely absurd, to use equery within an ebuild to find out how a previously installed version is configured (obviously, because I used that method ;-).

Therefore, if using equery within an ebuild can generally cause a forkbomb, I suggest that this is prevented by portage. Either by preventing ebuilds to use equery, or by preventing equery to fork indefinitely both ways should throw some kind of error message saying that there is an error in the ebuild.

However, if it is just the special way that I used equery, I think it is safe to close this bug as wontfix.

> Set SLOT=0 inside the ebuild an leave it that way.

Will do that. Thanks for the advice!

Comment 30 Zac Medico gentoo-dev

2009-12-16 10:32:13 UTC

(In reply to comment #29)
> Do I understand this issue correctly? Any use of equery within an ebuild will
> cause a forkbomb. Or is it just the particular way I used it.

It will only cause a forkbomb when used in global scope. Generally, it's poor practice to spawn any program in global scope because it's relatively slow and we want to be able to source the ebuild as quickly as possible. We already have to interceptor functions in place which are used to detect calls to common programs, but we can't cover everything.

> To me it does not seem so completely absurd, to use equery within an ebuild to
> find out how a previously installed version is configured (obviously, because I
> used that method ;-).

Typically, the has_version function is sufficient for this (although it shouldn't be called in global scope). With EAPI 2, has_version works with atoms containing USE dependencies, so you can use it to check which USE flags are enabled on the installed version of a package (has version is documented in `man 5 ebuild`.

> Therefore, if using equery within an ebuild can generally cause a forkbomb, I
> suggest that this is prevented by portage. Either by preventing ebuilds to use
> equery, or by preventing equery to fork indefinitely both ways should throw
> some kind of error message saying that there is an error in the ebuild.
> 
> However, if it is just the special way that I used equery, I think it is safe
> to close this bug as wontfix.

In svn r15098 I've added an interceptor for equery so that it generate a qa warning when called in global scope.

Comment 31 Zac Medico gentoo-dev

2009-12-17 04:35:59 UTC

This is fixed in 2.1.7.15 and 2.2_rc60.

Comment 32 Till Korten 2009-12-17 09:28:42 UTC

Great! Thanks a lot.