Summary: | equery called in global scope of ebuild triggers fork bomb | ||
---|---|---|---|
Product: | Portage Development | Reporter: | Till Korten <webmaster> |
Component: | Core - Ebuild Support | Assignee: | Portage team <dev-portage> |
Status: | RESOLVED FIXED | ||
Severity: | critical | CC: | che |
Priority: | High | Keywords: | InVCS |
Version: | unspecified | ||
Hardware: | x86 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Bug Depends on: | |||
Bug Blocks: | 288499 | ||
Attachments: |
output of emerge --info
the troublesome ebuild file required for the ebuild file required for the ebuild file required for the ebuild file required for the ebuild file required for the ebuild here is the manifest with which the problem occurs |
Description
Till Korten
2009-12-13 11:47:01 UTC
Created attachment 212861 [details]
output of emerge --info
Created attachment 212863 [details]
the troublesome ebuild
please note again, I don't want to debug that ebuild here. This ebuild is merely to demonstrate the problem. However, of course it will probably necessary to figure out what is wrong with that ebuild in order to fix portages reaction towards that error.
Created attachment 212864 [details]
file required for the ebuild
Created attachment 212865 [details]
file required for the ebuild
Created attachment 212867 [details]
file required for the ebuild
Created attachment 212868 [details]
file required for the ebuild
Created attachment 212870 [details]
file required for the ebuild
p.s. The ebuild was in my portage overlay (/usr/local/portage/app-backup/backuppc) If you can reproduce the problem, please try to get a backtrace with the SIGUSR1 procedure described in bug 266853, comment #3. this time it happened trying to digest the ebuild, so I killed the ebuild.sh command as described however, the result was not very informative (as far as I can tell): Sandboxed process killed by signal: User defined signal 1 close failed in file object destructor: Error in sys.excepthook: Original exception was: close failed in file object destructor: Error in sys.excepthook: the last three lines were repeated several times (20 x or so)... now, I tried again and this time killall did not help, I had to do ^C and that gave me a traceback: Sandboxed process killed by signal: User defined signal 1 ^CTraceback (most recent call last): File "/usr/bin/equery", line 25, in <module> import gentoolkit File "/usr/lib/gentoolkit/pym/gentoolkit/__init__.py", line 46, in <module> settings = portage.config(clone=portage.settings) File "/usr/lib/portage/pym/portage/__init__.py", line 1223, in __init__ daffy backuppc # if clone: File "/usr/lib/portage/pym/portage/proxy/objectproxy.py", line 66, in __nonzero__ return bool(object.__getattribute__(self, '_get_target')()) File "/usr/lib/portage/pym/portage/__init__.py", line 8254, in _get_target init_legacy_globals() File "/usr/lib/portage/pym/portage/__init__.py", line 8339, in init_legacy_globals db = create_trees(**kwargs) File "/usr/lib/portage/pym/portage/__init__.py", line 8216, in create_trees config_incrementals=portage.const.INCREMENTALS) File "/usr/lib/portage/pym/portage/__init__.py", line 1503, in __init__ expand=expand_map) File "/usr/lib/portage/pym/portage/util.py", line 483, in getconfig mykeys[key] = varexpand(val, expand_map) File "/usr/lib/portage/pym/portage/util.py", line 512, in varexpand while (pos<len(mystring)): KeyboardInterrupt close failed in file object destructor: Error in sys.excepthook: In top I see a lot of python 2.6 processes popping up, each taking about 20MB memory and 2.5 seconds processing time. and I just remembered that the problem first appeared after the upgrade to python 2.6. I hope this helps... (In reply to comment #10) > this time it happened trying to digest the ebuild, so I killed the ebuild.sh > command as described The SIGUSR1 thing only works for the python ebuild program. If the problem is in ebuild.sh then --debug option is going to be more helpful since that will give us a trace of what's happening in the bash interpreter. (In reply to comment #11) > File "/usr/lib/portage/pym/portage/__init__.py", line 1503, in __init__ > expand=expand_map) > File "/usr/lib/portage/pym/portage/util.py", line 483, in getconfig > mykeys[key] = varexpand(val, expand_map) > File "/usr/lib/portage/pym/portage/util.py", line 512, in varexpand > while (pos<len(mystring)): Here it was reading the content of /etc/profile.env. I checked that loop and it doesn't seem possible for the loop to be infinite. Is there anything strange about your /etc/profile.env? Is it's size reasonable? Sometimes filesystem corruption can make files appear to be abnormally large and cause problems for programs that try to read their entire content. Created attachment 212961 [details]
here is the manifest with which the problem occurs
Unfortunately, I was too quick and deleted the Manifest. After I delete the manifest, the digest works. The Manifest hasn't changed, though (I compared it with the version I uploaded). I also tried to change the timestamp of the Manifest to an older time using touch but that did not have an effect, either. Could it be that portage caches the ebuilds somewhere and the problem is in the cached ebuild? my /etc/profile.env is 1.6kB... (as reported by both ls and stat). this should be correct, right? Since the problem seemed to occur every week, especially on the weekends, I am going to check whether there is something in my crontab that could cause a race condition... (In reply to comment #16) > my /etc/profile.env is 1.6kB... (as reported by both ls and stat). this should > be correct, right? That's normal. I checked the crontab, but there is nothing happening within 2 hours of the emerge --sync && emerge -uDN world so the checking of dependencies should be long done before any other process is started... (In reply to comment #12) > In top I see a lot of python 2.6 processes popping up, each taking about 20MB > memory and 2.5 seconds processing time. That might be FEATURES=parallel-fetch. They shouldn't be a problem because only one is running at a given time. OTOH, if you have lots of identical processes that are spawned simultaneously then you may be experiencing some kind of accidental forkbomb. >
> That might be FEATURES=parallel-fetch. They shouldn't be a problem because only
> one is running at a given time.
>
No, this was during "ebuild xxx.ebuild digest" of a single ebuild (the files of which were already downloaded on the system before), so definitely no parallel fetching!!!
can it be that an endless loop occurs because equery is run in global scope requesting the slot of the same package, causing the metadata to be generated for the package, running equery in global scope, etc etc? (In reply to comment #21) > can it be that an endless loop occurs because equery is run in global scope > requesting the slot of the same package, causing the metadata to be generated > for the package, running equery in global scope, etc etc? > something like that, however as described this does not happen every time. Especially not initially after creating the Manifest or installing the ebuild. However at some point something outside of the ebuild gets screwed up. After this event (which happens approx every week on my machine) every time ebuild.sh or emerge is touching the ebuild, the memory leak occurs. I suspected that the part that gets screwed up is this file: /var/cache/edb/dep/usr/local/portage/app-backup/backuppc-3.1.0 And io and behold, deleting that file invariably triggers the problem. also, deleting the last line of that file "_mtime_=1254015152" or any of its characters will also trigger the problem so I suspect that the following things happen: at some point the file /var/cache/edb/dep/usr/local/portage/app-backup/backuppc-3.1.0 gets truncated, corrupted or deleted (or not recreated fast enough). Then some bug in the ebuild triggers a loop (maybe the suggested equery loop?) which eats up all memory and crashes the system. so as a fix, I suggest some consistency check, that checks if the cache file is correct or recreates it. maybe the check could be if the _mtime_= line is there AND if a newline is at the end of the file. Calling equery in global scope like that can't be supported. It's impossible. What you are doing with equery looks strange. Anyway, you might be able to use the portageq command to do what you want without triggering a recursive forkbomb. Actually, we can fix this by using a lockfile. So then you'll get a deadlock instead of a forkbomb. (In reply to comment #25) > Actually, we can fix this by using a lockfile. So then you'll get a deadlock > instead of a forkbomb. That said, it's probably not worth the trouble since it's very unlikely that anyone else will ever trigger this sort of thing. (In reply to comment #26) > (In reply to comment #25) > > Actually, we can fix this by using a lockfile. So then you'll get a deadlock > > instead of a forkbomb. > > That said, it's probably not worth the trouble since it's very unlikely that > anyone else will ever trigger this sort of thing. > Actually, I think this is a big security issue, since one single (purposefully or accidentally) tainted ebuild could cause EVERY gentoo machine out there to crash! That said, what would be the correct way to make the ebuild install into the same slot as the previous installation? (In reply to comment #27) > Actually, I think this is a big security issue, since one single (purposefully > or accidentally) tainted ebuild could cause EVERY gentoo machine out there to > crash! Fork bombs are a well known issue that can happen whenever any kind of code is executed. If somebody is able to taint your code then they can do practically anything they want, and it's your responsibility to avoid tainted code. The only thing portage can do to help you in this area would be to check digital signatures in order to establish trust in the code being executed. > That said, what would be the correct way to make the ebuild install into the > same slot as the previous installation? Set SLOT=0 inside the ebuild an leave it that way. The SLOT is supposed remain constant for a given ebuild since the value is cached (among other reasons). We do have some "multislot" ebuilds that modify SLOT in unsanctioned ways, but it causes problems and therefore isn't recommended. We may add official support for dynamic SLOTs in some future EAPI (see bug 174407), but we haven't done it yet. > Fork bombs are a well known issue that can happen whenever any kind of code is > executed. If somebody is able to taint your code then they can do practically > anything they want, and it's your responsibility to avoid tainted code. The > only thing portage can do to help you in this area would be to check digital > signatures in order to establish trust in the code being executed. > That makes totally sense... Do I understand this issue correctly? Any use of equery within an ebuild will cause a forkbomb. Or is it just the particular way I used it. To me it does not seem so completely absurd, to use equery within an ebuild to find out how a previously installed version is configured (obviously, because I used that method ;-). Therefore, if using equery within an ebuild can generally cause a forkbomb, I suggest that this is prevented by portage. Either by preventing ebuilds to use equery, or by preventing equery to fork indefinitely both ways should throw some kind of error message saying that there is an error in the ebuild. However, if it is just the special way that I used equery, I think it is safe to close this bug as wontfix. > Set SLOT=0 inside the ebuild an leave it that way. Will do that. Thanks for the advice! (In reply to comment #29) > Do I understand this issue correctly? Any use of equery within an ebuild will > cause a forkbomb. Or is it just the particular way I used it. It will only cause a forkbomb when used in global scope. Generally, it's poor practice to spawn any program in global scope because it's relatively slow and we want to be able to source the ebuild as quickly as possible. We already have to interceptor functions in place which are used to detect calls to common programs, but we can't cover everything. > To me it does not seem so completely absurd, to use equery within an ebuild to > find out how a previously installed version is configured (obviously, because I > used that method ;-). Typically, the has_version function is sufficient for this (although it shouldn't be called in global scope). With EAPI 2, has_version works with atoms containing USE dependencies, so you can use it to check which USE flags are enabled on the installed version of a package (has version is documented in `man 5 ebuild`. > Therefore, if using equery within an ebuild can generally cause a forkbomb, I > suggest that this is prevented by portage. Either by preventing ebuilds to use > equery, or by preventing equery to fork indefinitely both ways should throw > some kind of error message saying that there is an error in the ebuild. > > However, if it is just the special way that I used equery, I think it is safe > to close this bug as wontfix. In svn r15098 I've added an interceptor for equery so that it generate a qa warning when called in global scope. This is fixed in 2.1.7.15 and 2.2_rc60. Great! Thanks a lot. |