Gemato sometimes "hangs" during verification process, so I need to interrupt it with Ctrl+C and restart sync. Any ideas how this can be debugged? INFO:root:Refreshing keys from keyserver... INFO:root:Keys refreshed. INFO:root:Manifest timestamp: 2018-02-17 21:38:31 UTC INFO:root:Valid OpenPGP signature found: INFO:root:- primary key: DCD05B71EAB94199527F44ACDB6B8C1F96D8BF6D INFO:root:- subkey: E1D6ABB63BFCFB4BA02FDF1CEC590EEAC9189250 INFO:root:- timestamp: 2018-02-17 21:38:31 UTC INFO:root:Verifying /var/db/repos/gentoo... ^CProcess ForkPoolWorker-13: Exiting on signal Signals.SIGINT Process ForkPoolWorker-12: Process ForkPoolWorker-11: Process ForkPoolWorker-10: Process ForkPoolWorker-9: Traceback (most recent call last): File "/usr/lib64/python3.5/multiprocessing/pool.py", line 716, in next item = self._items.popleft() IndexError: pop from an empty deque During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python-exec/python3.5/gemato", line 11, in <module> load_entry_point('gemato==11.2', 'console_scripts', 'gemato')() File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 530, in setuptools_main sys.exit(main(argv)) File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 516, in main return vals.cmd() File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 212, in __call__ ret &= m.assert_directory_verifies(relpath, **self.kwargs) File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 596, in assert_directory_verifies entry_dict = self.get_file_entry_dict(path) File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 529, in get_file_entry_dict verify=verify_manifests) File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 411, in load_manifests_for_path self.loaded_manifests.update(manifests) File "/usr/lib64/python3.5/multiprocessing/pool.py", line 347, in <genexpr> return (item for chunk in result for item in chunk) File "/usr/lib64/python3.5/multiprocessing/pool.py", line 720, in next self._cond.wait(timeout) File "/usr/lib64/python3.5/threading.py", line 293, in wait waiter.acquire() KeyboardInterrupt $ ps auxwww | grep gemato portage 18442 0.1 0.5 400676 40748 pts/1 SNl+ 01:11 0:01 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg portage 18470 0.0 0.2 318908 23628 pts/1 SN+ 01:11 0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg portage 18471 0.0 0.3 321392 26144 pts/1 SN+ 01:11 0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg portage 18472 0.0 0.2 318800 23460 pts/1 SN+ 01:11 0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg portage 18473 0.0 0.2 318588 23188 pts/1 SN+ 01:11 0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg portage 18474 0.0 0.2 317856 22656 pts/1 SN+ 01:11 0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg portage 18475 0.0 0.2 317000 20776 pts/1 SN+ 01:11 0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg portage 18476 0.0 0.2 317000 20776 pts/1 SN+ 01:11 0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg portage 18477 0.0 0.2 317000 20776 pts/1 SN+ 01:11 0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg $ sudo strace -f -p18442 strace: Process 18442 attached with 4 threads [pid 18480] read(5, <unfinished ...> [pid 18479] futex(0x7fa84c0019e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=43981} <unfinished ...> [pid 18442] futex(0x56081e02d4b0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...> [pid 18478] <... select resumed> ) = 0 (Timeout) [pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout) [pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout) [pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout) [pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout) [pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout) [pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0 [pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0^Cstrace: Process 18442 detached <detached ...> $ sudo strace -p18470 strace: Process 18470 attached read(3, ^Cstrace: Process 18470 detached <detached ...> $ sudo strace -p18471 strace: Process 18471 attached futex(0x7fa8561f8000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff^Cstrace: Process 18471 detached <detached ...> $ sudo strace -p18477 strace: Process 18477 attached futex(0x7fa856201930, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 18477 detached <detached ...> $ sudo strace -p18475 strace: Process 18475 attached futex(0x7fa856201930, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 18475 detached <detached ...> $ sudo strace -p18472 strace: Process 18472 attached futex(0x7fa8561f8000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff^Cstrace: Process 18472 detached <detached ...>
Created attachment 519908 [details] emerge --info gemato
Could be this, it's the most recently reported multiprocessing pool deadlock which python-3.5.4-r1 doesn't have a fix for: https://bugs.python.org/issue32576 https://github.com/python/cpython/commit/ab74504346a6e2569b3255b7b621c589716888c4.patch
I've heard legends of things hanging but never seen it myself. If you can reproduce it, then please try the patch pointed out by Zac. If it seems to solve it, we'll backport it.
No data to contribute, but FWIW same thing happened to me with an emerge --sync launched from cron more than a week ago (only spotted and killed it just now prior to investigating and finding the forum thread about this). Verification ran normally and completed in a couple of minutes when performing the emerge --sync manually after that.
Could someone at least test which versions of Python are affected?
I've observed it in both python 3.5 and 3.6.
Chris, could you try the attached patch then, on 3.6? It should work there without extra backports.
Same issue in two consecutive sync runs, one from cron and one from console. (In reply to Michał Górny from comment #7) > Chris, could you try the attached patch then, on 3.6? It should work there > without extra backports. Nope: File "/var/tmp/portage/dev-lang/python-3.6.4/work/Python-3.6.4/Lib/concurrent/futures/thread.py", line 107, in __init__ self._work_queue = queue.SimpleQueue() AttributeError: module 'queue' has no attribute 'SimpleQueue' make: *** [Makefile:597: sharedmods] Error 1 * ERROR: dev-lang/python-3.6.4::gentoo failed (compile phase): * emake failed Guess this depends on https://github.com/python/cpython/commit/94e1696d04c65e19ea52e5c8664079c9d9aa0e54.patch but I haven't yet tried whether the combination of these two works without further tweaks.
In the process of trying the patch from comment 2, I switched my default Python interpreter from 3.5 to 3.6. I had one successful sync with this, but next one failed again. So it's not fixed in 3.6 yet, and it's not 100% reproducible. 94e1696d04c65e19ea52e5c8664079c9d9aa0e54.patch fails to apply to python-3.6.4, so backporting SimpleQueue is more complicated than just dropping the unmodified patch into /etc/portage/patches/dev-lang/python-3.6.4.
Oh, I'm sorry. I have confused it with multiprocessing.queues.SimpleQueue that is present in py3.6. Maybe it's a suitable replacement... I'll try to write a patch soonish.
I'm sorry to say but I'm unable to write a working patch.
Created attachment 523086 [details, diff] 0001-HACK-Disable-GC-for-multiprocessing.patch Could you try testing with the attached hack-patch applied to app-portage/gemato? If it works around the problem, I'll make a proper patch for it.
(In reply to Michał Górny from comment #12) > Created attachment 523086 [details, diff] [details, diff] > 0001-HACK-Disable-GC-for-multiprocessing.patch > > Could you try testing with the attached hack-patch applied to > app-portage/gemato? If it works around the problem, I'll make a proper patch > for it. Looks hung again: no CPU time and no other child processes after a few minutes. So perhaps it's not that queue bug after all.
(In reply to Martin von Gagern from comment #13) > Looks hung again: no CPU time and no other child processes after a few > minutes. So perhaps it's not that queue bug after all. I might have been wrong. Experimenting with this issue I found that gemato builds a list of files first and does the verification second. The former depends a lot on fs caches, and with cold caches will be seriously io bound. I guess that might be what made me assume it was hung. That and the fact that the regular SIGINT stack trace seems to look very much like the one originally reported. Use strace to see whether any process is doing io. Re-ran sync once and gemato a couple of times afterwards, so that looks good. On the other hand, I can't reproduce with plain vanilla gemato right now either, so it's not saying much I'm afraid. I wonder whether the fact that I closed some memory-hungry apps and thus freed space for use as caches has anything to do with the reproducibility. It certainly seems to have something to do with runtime, and I guess longer runtime might increase chances of encountering a problematic GC. Do you have any idea where the problematic call from GC finalizer to thread pool is set up? Is that something inherent in multiprocessing, or something about how it's being used?
I'm sorry but this is really too low-level magic for me. I have tried a few scenarios to reproduce it (including 'nice 19' + high load to make it very slow) and I couldn't. What would be really nice if someone could, in order: 1. confirm it hanging, 2. apply the patch (preferably straight to /usr/lib*/python*/site-packages), 3. try hard to make it hang again.
I can confirm it hanging. 4 Cores HT (i7), plenty of RAM, virtually no load while syncing in the morning and evening via eix-sync cron job. Usually sync and verify takes less than a minute. Hangs every 3rd or 4th run until I issue 'killall gemato'. I have applied the patch to both python3.5/site-packages and python3.6/site-packages and try hard ;-)
(In reply to Michael Brakemeier from comment #16) > I can confirm it hanging. 4 Cores HT (i7), plenty of RAM, virtually no load > while syncing in the morning and evening via eix-sync cron job. Usually sync > and verify takes less than a minute. Hangs every 3rd or 4th run until I > issue 'killall gemato'. > > I have applied the patch to both python3.5/site-packages and > python3.6/site-packages and try hard ;-) Does that mean that you haven't been able to reproduce it or are you still testing?
(In reply to Michał Górny from comment #17) > (In reply to Michael Brakemeier from comment #16) > [...] > > I have applied the patch to both python3.5/site-packages and > > python3.6/site-packages and try hard ;-) > > Does that mean that you haven't been able to reproduce it or are you still > testing? Still testing...
Sorry, but no change. Gemato hanging again in fifth sync.
I'd be curious to see if swapping out the imap_unordered with plain map suppresses the problem, since the traceback in comment #0 shows it's waiting in IMapIterator.next(). This would do it: sed 's:imap_unordered:map:' \ -i /usr/lib*/python*/site-packages/gemato/recursiveloader.py
(In reply to Zac Medico from comment #20) > I'd be curious to see if swapping out the imap_unordered with plain map > suppresses the problem, since the traceback in comment #0 shows it's waiting > in IMapIterator.next(). > > This would do it: > > sed 's:imap_unordered:map:' \ > -i /usr/lib*/python*/site-packages/gemato/recursiveloader.py No, tried that on top of the patch. Still hanging. Is there anything else I could test?
I'm sorry but I really have no idea how to solve this. In the worst case, we can remove the parallel code and resort back to serial operations which will be up to N times slower on N-core systems.
If you replace imap_unordered with map as suggested in comment #20 then how does the traceback look if you Ctrl+C while it is deadlocked? It should look different than the traceback in comment #0.
I've reverted all changes. Now I will try sync-rsync-verify-jobs = 1 in repos.conf/gentoo.conf as mentioned here https://forums.gentoo.org/viewtopic-t-1076430.html. As a side note: I have two boxes here, both running ~amd64 with very similar configurations. The first one, my 24/7 running "home-server" is a i3-3240 w/ 2 cores HT and spinning disks, never experienced this issue. The second one, i7-7700K with fast SSD, reliably hangs every 3rd or 4th run. So, removing the parallel code completely is probably not the best option. @zmedico: I will try that next.
It might help to have a gemato debug option that enables debug logging for multiprocessing as documented here: https://docs.python.org/3/library/multiprocessing.html#logging For example: multiprocessing.log_to_stderr().setLevel(logging.DEBUG)
(In reply to Alexander Tsoy from comment #0) > [pid 18442] futex(0x56081e02d4b0, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff > <unfinished ...> > [pid 18478] <... select resumed> ) = 0 (Timeout) > > [pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0 > > [pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0 > > [pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0 > > [pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0 > > [pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 > (Timeout) > [pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0 > > [pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0 > [pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 The wait4/select loop appears to indicate that it's waiting on a hung gnupg process.
The hung command would be this command, with the Manifest signature fed through stdin: gpg2 --batch --status-fd 1 --verify
Since gemato is using python's subprocess.communicate() method here, I checked the code for any recent changes. The code looks good to me, with the last change in the relevant _communicate method being the addition of BrokenPipeError handling 2 years ago: https://github.com/python/cpython/commit/1ef8c7e886ea5260e5a6967ec2b8a4c32640f1a8 If we can get some strace output for a hung /usr/bin/gpg2 process then hopefully it will provide a clue.
(In reply to Zac Medico from comment #26) > The wait4/select loop appears to indicate that it's waiting on a hung gnupg > process. PIDs it's waiting for correspond to the child gemato processes. And I don't see any gpg2 processes: $ pgrep -au portage 13757 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg 13762 gpg-agent --homedir /tmp/tmpoqn42xe5 --use-standard-socket --daemon 13767 dirmngr --daemon --homedir /tmp/tmpoqn42xe5 13780 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg 13781 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg 13782 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg 13783 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
(In reply to Alexander Tsoy from comment #29) > (In reply to Zac Medico from comment #26) > > The wait4/select loop appears to indicate that it's waiting on a hung gnupg > > process. > > PIDs it's waiting for correspond to the child gemato processes. And I don't > see any gpg2 processes: Yes, now I see that the wait4 pids in comment #0 correspond to the python processes. I don't understand the reason for those wait4 calls, maybe it's a clue, but it seems clear that gpg is not at fault.
For the record: Last night the notorious box did 41 cycles of "sleep 600 && rm /usr/portage/metadata/timestamp.chk && emaint -a sync" in a loop run from terminal with the gemato code patched according to comment #20 without hanging. Still testing, now using a cron job. BBL.
Before you get banned for abusing rsync, I think you should be able to reproduce it by just calling: gemato verify -P /your/repo/location (-P can be used to rule GPG issues out)
Hopefully using map instead of imap_unordered will prevent the deadlocks. If we don't find a way to avoid multiprocessing.Pool deadlocks, as an alternative there's the concurrent.futures.ProcessPoolExecutor class which is implemented differently and has a compatible map method: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map
(In reply to Zac Medico from comment #33) > Hopefully using map instead of imap_unordered will prevent the deadlocks. I'm sorry but forget about the promising results from comment #31. I've missed that the default USE rsync-verify has been reverted while updating on sunday.
Created attachment 523828 [details, diff] Use concurrent.futures.ProcessPoolExecutor for python3.5+ With this patch, gemato will use concurrent.futures.ProcessPoolExecutor for python3.5+. The underlying implementation is different from multiprocessing.Pool, so hopefully it doesn't deadlock.
For the record: Current state - unable to reproduce the deadlock. What I did: - Actively selected python 3.6 using "eselect python update" (never did this before on the notoriously hanging box). - Update world, including gemato 12.0. - Re-merged portage with use-flag rsync-verify enabled. -> one week over-night testing with sync once per hour via cronjob, no hang. - Update world, including gemato 12.1 -> a weekend 24h testing with sync once per hour via cronjob, no hang.
(In reply to Michael Brakemeier from comment #36) > For the record: > Current state - unable to reproduce the deadlock. > > What I did: > - Actively selected python 3.6 using "eselect python update" (never did this > before on the notoriously hanging box). > - Update world, including gemato 12.0. > - Re-merged portage with use-flag rsync-verify enabled. > -> one week over-night testing with sync once per hour via cronjob, no hang. > > - Update world, including gemato 12.1 > -> a weekend 24h testing with sync once per hour via cronjob, no hang. Would be nice if you could return to 3.5 and confirm it deadlocks again. Then we'd at least have a really reliable confirmation that it doesn't happen with 3.6.
Given no hope of finding the issue, I think I'll be removing multiprocessing support from gemato in a few days in order to finally make it possible to stabilize it. Stability is more important than speed.
(In reply to Michał Górny from comment #38) > Given no hope of finding the issue, I think I'll be removing multiprocessing > support from gemato in a few days in order to finally make it possible to > stabilize it. Stability is more important than speed. Maybe do something like the Pool class from attachment 523828 [details, diff] that reduces it to a plain map() call for the processes == 1 case.
That doesn't help much if we either would have to force '-j 1' for everyone or tell users to check that it doesn't hang.
Hmm, maybe I'll try replacing the other imap_unordered() with map() fallback for <3.5.4 but I suppose it won't help with this.
Ok, please try 12.2. If it doesn't work, it will be probably the long-term ~arch target for 'brave downgrade'.
I've tried to reproduce the deadlock with python 3.6 and also 3.5 as requested and gemato 12.1 for the last three weeks without result. Whatever has been changed - it seems, that at least for me these changes were successful.
Ping. Anyone else can/cannot reproduce this? This is crucial to knowing whether we can stabilize gemato or not.
I finally managed to get access to a system where I saw most of deadlocks previously (Ivi Bridge 4C/8T) and also testing gemato-12.2 on other systems. No problems so far, but ideally it needs several days to make sure new gemato really fixed the problem.
I tested gemato-12.2 + python-3.5 in a loop and unfortunately reproduced deadlock on two different systems. It took 257 iterations on one of the systems and 588 iterations on the other.
Created attachment 526914 [details, diff] Use concurrent.futures.ProcessPoolExecutor for python3.5+ Rebase on gemato-12.2.
(In reply to Zac Medico from comment #47) > Created attachment 526914 [details, diff] [details, diff] > Use concurrent.futures.ProcessPoolExecutor for python3.5+ > > Rebase on gemato-12.2. Got deadlock with this patch on the 5th iteration of the loop. =/
(In reply to Alexander Tsoy from comment #48) > (In reply to Zac Medico from comment #47) > > Created attachment 526914 [details, diff] [details, diff] [details, diff] > > Use concurrent.futures.ProcessPoolExecutor for python3.5+ > > > > Rebase on gemato-12.2. > > Got deadlock with this patch on the 5th iteration of the loop. =/ It would be interesting to see what the traceback looks like when you kill it.
(In reply to Zac Medico from comment #49) Process Process-10: Process Process-9: Traceback (most recent call last): File "/usr/lib64/python3.5/multiprocessing/process.py", line 252, in _bootstrap self.run() File "/usr/lib64/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/usr/lib64/python3.5/concurrent/futures/process.py", line 169, in _process_worker call_item = call_queue.get(block=True) File "/usr/lib64/python3.5/multiprocessing/queues.py", line 93, in get with self._rlock: File "/usr/lib64/python3.5/multiprocessing/synchronize.py", line 96, in __enter__ return self._semlock.__enter__() KeyboardInterrupt Traceback (most recent call last): File "/usr/lib/python-exec/python3.5/gemato", line 11, in <module> load_entry_point('gemato==12.2', 'console_scripts', 'gemato')() File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 549, in setuptools_main sys.exit(main(argv)) File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 535, in main return vals.cmd() File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 231, in __call__ ret &= m.assert_directory_verifies(relpath, **self.kwargs) File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 587, in assert_directory_verifies entry_dict = self.get_file_entry_dict(path) File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 520, in get_file_entry_dict verify=verify_manifests) File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 407, in load_manifests_for_path self.loaded_manifests.update(manifests) File "/usr/lib64/python3.5/concurrent/futures/_base.py", line 556, in result_iterator yield future.result() File "/usr/lib64/python3.5/concurrent/futures/_base.py", line 400, in result self._condition.wait(timeout) File "/usr/lib64/python3.5/threading.py", line 293, in wait waiter.acquire() KeyboardInterrupt
(In reply to Alexander Tsoy from comment #50) > File "/usr/lib64/python3.5/concurrent/futures/_base.py", line 400, in > result > self._condition.wait(timeout) It's waiting for a set_result call there, and those calls are managed by a _queue_management_worker thread that reads from a result_queue which is an instance of multiprocessing.SimpleQueue. So, it looks like the deadlock probably happens somewhere in the queue, much like https://bugs.python.org/issue32576.
Created attachment 526984 [details, diff] gc-hack.diff Updated GC hack, in case anyone wanted to try it.
(In reply to Michał Górny from comment #52) > Created attachment 526984 [details, diff] [details, diff] > gc-hack.diff > > Updated GC hack, in case anyone wanted to try it. Thanks. I'm testing it now.
GC hack doesn't help. Just reproduced deadlock on two different machines again.
Could you try with -j1? That would confirm whether it happens only with simultaneous op or if it's problem with message passing.
(In reply to Michał Górny from comment #55) No deadlocks for 2 days of testing with -j1.
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=ff513d3274d3d103d701025bcc8255d49b157456 commit ff513d3274d3d103d701025bcc8255d49b157456 Author: Michał Górny <mgorny@gentoo.org> AuthorDate: 2018-04-13 13:31:07 +0000 Commit: Michał Górny <mgorny@gentoo.org> CommitDate: 2018-04-13 13:31:14 +0000 app-portage/gemato: Bump to 13.0 (non-parallel version) Bug: https://bugs.gentoo.org/647964 app-portage/gemato/Manifest | 1 + app-portage/gemato/gemato-13.0.ebuild | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+)}
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/proj/portage.git/commit/?id=7b448e90034de00bc177b3d809aeaf9b94d55ee2 commit 7b448e90034de00bc177b3d809aeaf9b94d55ee2 Author: Zac Medico <zmedico@gentoo.org> AuthorDate: 2018-04-13 17:01:45 +0000 Commit: Zac Medico <zmedico@gentoo.org> CommitDate: 2018-04-17 02:25:25 +0000 rsync: default to sync-rsync-verify-jobs = 1 (bug 650696) Some users have reported that using all processors to verify manifests results in overloading, therefore default to using a single processor. On modern hardware, verification of the gentoo repository completes in less than 20 seconds, so using multiple processors is not really necessary. Also, gemato-13.0 disables parallel verification due to the unresolved deadlock issue reported in bug 647964, so this brings the default portage configuration into alignment with current gemato behavior. Bug: https://bugs.gentoo.org/650696 Bug: https://bugs.gentoo.org/647964 cnf/repos.conf | 1 + man/portage.5 | 8 +++++--- pym/portage/sync/modules/rsync/rsync.py | 7 ++++++- 3 files changed, 12 insertions(+), 4 deletions(-)}
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=896aadcc7d554a275953ac86458b0ecbbc0df55a commit 896aadcc7d554a275953ac86458b0ecbbc0df55a Author: Michał Górny <mgorny@gentoo.org> AuthorDate: 2018-08-01 11:02:31 +0000 Commit: Michał Górny <mgorny@gentoo.org> CommitDate: 2018-08-01 11:36:24 +0000 app-portage/gemato: Add 14.0+multiprocessing version Bug: https://bugs.gentoo.org/647964 app-portage/gemato/Manifest | 1 + app-portage/gemato/gemato-14.0m.ebuild | 53 ++++++++++++++++++++++++++++++++++ profiles/package.mask | 1 + 3 files changed, 55 insertions(+) https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=ec821e92460679181bf2d63a8f899f0b8fbf7e24 commit ec821e92460679181bf2d63a8f899f0b8fbf7e24 Author: Michał Górny <mgorny@gentoo.org> AuthorDate: 2018-08-01 10:09:10 +0000 Commit: Michał Górny <mgorny@gentoo.org> CommitDate: 2018-08-01 11:36:24 +0000 app-portage/gemato: Add a live ebuild for multiprocessing branch Add a separate ebuild for multiprocessing branch and keep it p.masked due to the known hangs. Hopefully it will help testing and finding a proper solution to that. Bug: https://bugs.gentoo.org/647964 app-portage/gemato/gemato-9999m.ebuild | 50 ++++++++++++++++++++++++++++++++++ profiles/package.mask | 6 ++++ 2 files changed, 56 insertions(+)
I'd appreciate if you could try reproducing the issues with the newest version. Once you confirm it still happens, please try if python3.7 solves it.
(In reply to Michał Górny from comment #60) > I'd appreciate if you could try reproducing the issues with the newest > version. Once you confirm it still happens, please try if python3.7 solves > it. Can't reproduce it anymore. I tested it with python3.6.
This may have been due to mixing of threading with the default 'fork' multiprocessing start method, which happens in multiprocessing.Pool. Due to deadlocks, in https://github.com/python/cpython/issues/84559 they are planning to change the default multiprocessing start method to 'spawn' in python 3.14.
See https://github.com/python/cpython/issues/84559#issuecomment-1093869867 too...