647964 – app-portage/gemato-14.1m: deadlock?

Bug 647964 - app-portage/gemato-14.1m: deadlock?

Summary: app-portage/gemato-14.1m: deadlock?

Status:	UNCONFIRMED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Michał Górny

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	650144
	Show dependency tree

Reported:	2018-02-17 22:43 UTC by Alexander Tsoy
Modified:	2024-03-03 22:24 UTC (History)
CC List:	13 users (show)

See Also:	https://github.com/python/cpython/issues/84559
Package list:
Runtime testing required:	---

Attachments
emerge --info gemato (emerge.info,6.96 KB, text/plain) 2018-02-17 22:45 UTC, Alexander Tsoy	Details
0001-HACK-Disable-GC-for-multiprocessing.patch (0001-HACK-Disable-GC-for-multiprocessing.patch,1.45 KB, patch) 2018-03-09 10:45 UTC, Michał Górny	Details \| Diff
Use concurrent.futures.ProcessPoolExecutor for python3.5+ (executor_python3_5.patch,1.94 KB, patch) 2018-03-13 23:52 UTC, Zac Medico	Details \| Diff
Use concurrent.futures.ProcessPoolExecutor for python3.5+ (0001-Use-ProcessPoolExecutor-bug-647964.patch,2.37 KB, patch) 2018-04-09 16:10 UTC, Zac Medico	Details \| Diff
gc-hack.diff (gc-hack.diff,777 bytes, patch) 2018-04-10 06:35 UTC, Michał Górny	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexander Tsoy 2018-02-17 22:43:04 UTC

Gemato sometimes "hangs" during verification process, so I need to interrupt it with Ctrl+C and restart sync. Any ideas how this can be debugged?


INFO:root:Refreshing keys from keyserver...
INFO:root:Keys refreshed.
INFO:root:Manifest timestamp: 2018-02-17 21:38:31 UTC
INFO:root:Valid OpenPGP signature found:
INFO:root:- primary key: DCD05B71EAB94199527F44ACDB6B8C1F96D8BF6D
INFO:root:- subkey: E1D6ABB63BFCFB4BA02FDF1CEC590EEAC9189250
INFO:root:- timestamp: 2018-02-17 21:38:31 UTC
INFO:root:Verifying /var/db/repos/gentoo...
^CProcess ForkPoolWorker-13:


Exiting on signal Signals.SIGINT
Process ForkPoolWorker-12:
Process ForkPoolWorker-11:
Process ForkPoolWorker-10:
Process ForkPoolWorker-9:
Traceback (most recent call last):
  File "/usr/lib64/python3.5/multiprocessing/pool.py", line 716, in next
    item = self._items.popleft()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.5/gemato", line 11, in <module>
    load_entry_point('gemato==11.2', 'console_scripts', 'gemato')()
  File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 530, in setuptools_main
    sys.exit(main(argv))
  File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 516, in main
    return vals.cmd()
  File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 212, in __call__
    ret &= m.assert_directory_verifies(relpath, **self.kwargs)
  File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 596, in assert_directory_verifies
    entry_dict = self.get_file_entry_dict(path)
  File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 529, in get_file_entry_dict
    verify=verify_manifests)
  File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 411, in load_manifests_for_path
    self.loaded_manifests.update(manifests)
  File "/usr/lib64/python3.5/multiprocessing/pool.py", line 347, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/usr/lib64/python3.5/multiprocessing/pool.py", line 720, in next
    self._cond.wait(timeout)
  File "/usr/lib64/python3.5/threading.py", line 293, in wait
    waiter.acquire()
KeyboardInterrupt


$ ps auxwww | grep gemato
portage  18442  0.1  0.5 400676 40748 pts/1    SNl+ 01:11   0:01 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
portage  18470  0.0  0.2 318908 23628 pts/1    SN+  01:11   0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
portage  18471  0.0  0.3 321392 26144 pts/1    SN+  01:11   0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
portage  18472  0.0  0.2 318800 23460 pts/1    SN+  01:11   0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
portage  18473  0.0  0.2 318588 23188 pts/1    SN+  01:11   0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
portage  18474  0.0  0.2 317856 22656 pts/1    SN+  01:11   0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
portage  18475  0.0  0.2 317000 20776 pts/1    SN+  01:11   0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
portage  18476  0.0  0.2 317000 20776 pts/1    SN+  01:11   0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
portage  18477  0.0  0.2 317000 20776 pts/1    SN+  01:11   0:00 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg

$ sudo strace -f -p18442
strace: Process 18442 attached with 4 threads
[pid 18480] read(5,  <unfinished ...>
[pid 18479] futex(0x7fa84c0019e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=43981} <unfinished ...>
[pid 18442] futex(0x56081e02d4b0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 18478] <... select resumed> )      = 0 (Timeout)                            
[pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0              
[pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0                   
[pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0                  
[pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0                   
[pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0             
[pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0                   
[pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0               
[pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0                    
[pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
[pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0                
[pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0           
[pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0                    
[pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0                 
[pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0               
[pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0                
[pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0                
[pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0                
[pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
[pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0                                                                                                                                                          
[pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0                                                                                                                                                          
[pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0                                                                                                                                                          
[pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0                                                                                                                                                          
[pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0                                                                                                                                                          
[pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
[pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
[pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
[pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0
[pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0^Cstrace: Process 18442 detached
 <detached ...>

$ sudo strace -p18470
strace: Process 18470 attached
read(3, ^Cstrace: Process 18470 detached
 <detached ...>

$ sudo strace -p18471
strace: Process 18471 attached
futex(0x7fa8561f8000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff^Cstrace: Process 18471 detached
 <detached ...>

$ sudo strace -p18477
strace: Process 18477 attached
futex(0x7fa856201930, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 18477 detached
 <detached ...>

$ sudo strace -p18475
strace: Process 18475 attached
futex(0x7fa856201930, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 18475 detached
 <detached ...>

$ sudo strace -p18472
strace: Process 18472 attached
futex(0x7fa8561f8000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff^Cstrace: Process 18472 detached
 <detached ...>

Comment 1 Alexander Tsoy 2018-02-17 22:45:26 UTC

Created attachment 519908 [details]
emerge --info gemato

Comment 2 Zac Medico gentoo-dev

2018-02-18 03:38:50 UTC

Could be this, it's the most recently reported multiprocessing pool deadlock which python-3.5.4-r1 doesn't have a fix for:

https://bugs.python.org/issue32576
https://github.com/python/cpython/commit/ab74504346a6e2569b3255b7b621c589716888c4.patch

Comment 3 Michał Górny archtester

2018-02-18 08:37:51 UTC

I've heard legends of things hanging but never seen it myself. If you can reproduce it, then please try the patch pointed out by Zac. If it seems to solve it, we'll backport it.

Comment 4 Robin Bankhead 2018-03-04 21:41:18 UTC

No data to contribute, but FWIW same thing happened to me with an emerge --sync launched from cron more than a week ago (only spotted and killed it just now prior to investigating and finding the forum thread about this).

Verification ran normally and completed in a couple of minutes when performing the emerge --sync manually after that.

Comment 5 Michał Górny archtester

2018-03-05 17:44:04 UTC

Could someone at least test which versions of Python are affected?

Comment 6 Chris Reffett (RETIRED) gentoo-dev

2018-03-05 22:48:50 UTC

I've observed it in both python 3.5 and 3.6.

Comment 7 Michał Górny archtester

2018-03-06 07:39:30 UTC

Chris, could you try the attached patch then, on 3.6? It should work there without extra backports.

Comment 8 Martin von Gagern 2018-03-08 22:01:47 UTC

Same issue in two consecutive sync runs, one from cron and one from console.

(In reply to Michał Górny from comment #7)
> Chris, could you try the attached patch then, on 3.6? It should work there
> without extra backports.

Nope:
  File "/var/tmp/portage/dev-lang/python-3.6.4/work/Python-3.6.4/Lib/concurrent/futures/thread.py", line 107, in __init__
    self._work_queue = queue.SimpleQueue()
AttributeError: module 'queue' has no attribute 'SimpleQueue'
make: *** [Makefile:597: sharedmods] Error 1   
 * ERROR: dev-lang/python-3.6.4::gentoo failed (compile phase):
 *   emake failed

Guess this depends on https://github.com/python/cpython/commit/94e1696d04c65e19ea52e5c8664079c9d9aa0e54.patch but I haven't yet tried whether the combination of these two works without further tweaks.

Comment 9 Martin von Gagern 2018-03-08 22:16:30 UTC

In the process of trying the patch from comment 2, I switched my default Python interpreter from 3.5 to 3.6. I had one successful sync with this, but next one failed again. So it's not fixed in 3.6 yet, and it's not 100% reproducible.

94e1696d04c65e19ea52e5c8664079c9d9aa0e54.patch fails to apply to python-3.6.4, so backporting SimpleQueue is more complicated than just dropping the unmodified patch into /etc/portage/patches/dev-lang/python-3.6.4.

Comment 10 Michał Górny archtester

2018-03-09 06:59:11 UTC

Oh, I'm sorry. I have confused it with multiprocessing.queues.SimpleQueue that is present in py3.6. Maybe it's a suitable replacement... I'll try to write a patch soonish.

Comment 11 Michał Górny archtester

2018-03-09 08:20:07 UTC

I'm sorry to say but I'm unable to write a working patch.

Comment 12 Michał Górny archtester

2018-03-09 10:45:30 UTC

Created attachment 523086 [details, diff]
0001-HACK-Disable-GC-for-multiprocessing.patch

Could you try testing with the attached hack-patch applied to app-portage/gemato? If it works around the problem, I'll make a proper patch for it.

Comment 13 Martin von Gagern 2018-03-10 13:45:59 UTC

(In reply to Michał Górny from comment #12)
> Created attachment 523086 [details, diff] [details, diff]
> 0001-HACK-Disable-GC-for-multiprocessing.patch
> 
> Could you try testing with the attached hack-patch applied to
> app-portage/gemato? If it works around the problem, I'll make a proper patch
> for it.

Looks hung again: no CPU time and no other child processes after a few minutes. So perhaps it's not that queue bug after all.

Comment 14 Martin von Gagern 2018-03-10 18:09:23 UTC

(In reply to Martin von Gagern from comment #13)
> Looks hung again: no CPU time and no other child processes after a few
> minutes. So perhaps it's not that queue bug after all.

I might have been wrong. Experimenting with this issue I found that gemato builds a list of files first and does the verification second. The former depends a lot on fs caches, and with cold caches will be seriously io bound. I guess that might be what made me assume it was hung. That and the fact that the regular SIGINT stack trace seems to look very much like the one originally reported. Use strace to see whether any process is doing io.

Re-ran sync once and gemato a couple of times afterwards, so that looks good. On the other hand, I can't reproduce with plain vanilla gemato right now either, so it's not saying much I'm afraid. I wonder whether the fact that I closed some memory-hungry apps and thus freed space for use as caches has anything to do with the reproducibility. It certainly seems to have something to do with runtime, and I guess longer runtime might increase chances of encountering a problematic GC.

Do you have any idea where the problematic call from GC finalizer to thread pool is set up? Is that something inherent in multiprocessing, or something about how it's being used?

Comment 15 Michał Górny archtester

2018-03-10 18:25:02 UTC

I'm sorry but this is really too low-level magic for me. I have tried a few scenarios to reproduce it (including 'nice 19' + high load to make it very slow) and I couldn't.

What would be really nice if someone could, in order:

1. confirm it hanging,

2. apply the patch (preferably straight to /usr/lib*/python*/site-packages),

3. try hard to make it hang again.

Comment 16 Michael Brakemeier 2018-03-10 20:05:30 UTC

I can confirm it hanging. 4 Cores HT (i7), plenty of RAM, virtually no load while syncing in the morning and evening via eix-sync cron job. Usually sync and verify takes less than a minute. Hangs every 3rd or 4th run until I issue 'killall gemato'.

I have applied the patch to both python3.5/site-packages and python3.6/site-packages and try hard ;-)

Comment 17 Michał Górny archtester

2018-03-10 21:51:32 UTC

(In reply to Michael Brakemeier from comment #16)
> I can confirm it hanging. 4 Cores HT (i7), plenty of RAM, virtually no load
> while syncing in the morning and evening via eix-sync cron job. Usually sync
> and verify takes less than a minute. Hangs every 3rd or 4th run until I
> issue 'killall gemato'.
> 
> I have applied the patch to both python3.5/site-packages and
> python3.6/site-packages and try hard ;-)

Does that mean that you haven't been able to reproduce it or are you still testing?

Comment 18 Michael Brakemeier 2018-03-10 22:25:28 UTC

(In reply to Michał Górny from comment #17)
> (In reply to Michael Brakemeier from comment #16)
> [...]
> > I have applied the patch to both python3.5/site-packages and
> > python3.6/site-packages and try hard ;-)
> 
> Does that mean that you haven't been able to reproduce it or are you still
> testing?

Still testing...

Comment 19 Michael Brakemeier 2018-03-11 00:35:09 UTC

Sorry, but no change. Gemato hanging again in fifth sync.

Comment 20 Zac Medico gentoo-dev

2018-03-11 01:03:29 UTC

I'd be curious to see if swapping out the imap_unordered with plain map suppresses the problem, since the traceback in comment #0 shows it's waiting in IMapIterator.next().

This would do it:

sed 's:imap_unordered:map:' \
-i /usr/lib*/python*/site-packages/gemato/recursiveloader.py

Comment 21 Michael Brakemeier 2018-03-11 08:55:36 UTC

(In reply to Zac Medico from comment #20)
> I'd be curious to see if swapping out the imap_unordered with plain map
> suppresses the problem, since the traceback in comment #0 shows it's waiting
> in IMapIterator.next().
> 
> This would do it:
> 
> sed 's:imap_unordered:map:' \
> -i /usr/lib*/python*/site-packages/gemato/recursiveloader.py

No, tried that on top of the patch. Still hanging.
Is there anything else I could test?

Comment 22 Michał Górny archtester

2018-03-11 10:03:34 UTC

I'm sorry but I really have no idea how to solve this. In the worst case, we can remove the parallel code and resort back to serial operations which will be up to N times slower on N-core systems.

Comment 23 Zac Medico gentoo-dev

2018-03-11 10:28:54 UTC

If you replace imap_unordered with map as suggested in comment #20 then how does the traceback look if you Ctrl+C while it is deadlocked? It should look different than the traceback in comment #0.

Comment 24 Michael Brakemeier 2018-03-11 10:34:50 UTC

I've reverted all changes. Now I will try sync-rsync-verify-jobs = 1 in repos.conf/gentoo.conf as mentioned here https://forums.gentoo.org/viewtopic-t-1076430.html.

As a side note: I have two boxes here, both running ~amd64 with very similar configurations. The first one, my 24/7 running "home-server" is a i3-3240 w/ 2 cores HT and spinning disks, never experienced this issue. The second one, i7-7700K with fast SSD, reliably hangs every 3rd or 4th run.
So, removing the parallel code completely is probably not the best option.

@zmedico: I will try that next.

Comment 25 Zac Medico gentoo-dev

2018-03-11 22:13:46 UTC

It might help to have a gemato debug option that enables debug logging for multiprocessing as documented here:

    https://docs.python.org/3/library/multiprocessing.html#logging

For example:

    multiprocessing.log_to_stderr().setLevel(logging.DEBUG)

Comment 26 Zac Medico gentoo-dev

2018-03-11 23:42:34 UTC

(In reply to Alexander Tsoy from comment #0)
> [pid 18442] futex(0x56081e02d4b0,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff
> <unfinished ...>
> [pid 18478] <... select resumed> )      = 0 (Timeout)                       
> 
> [pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0              
> [pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0                 
> 
> [pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0                  
> [pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0                 
> 
> [pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0             
> [pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0                 
> 
> [pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0               
> [pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0                 
> 
> [pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0
> (Timeout)
> [pid 18478] wait4(18477, 0x7fa84affbd84, WNOHANG, NULL) = 0                
> [pid 18478] wait4(18476, 0x7fa84affbd84, WNOHANG, NULL) = 0           
> [pid 18478] wait4(18475, 0x7fa84affbd84, WNOHANG, NULL) = 0                 
> 
> [pid 18478] wait4(18474, 0x7fa84affbd84, WNOHANG, NULL) = 0                 
> [pid 18478] wait4(18473, 0x7fa84affbd84, WNOHANG, NULL) = 0               
> [pid 18478] wait4(18472, 0x7fa84affbd84, WNOHANG, NULL) = 0                
> [pid 18478] wait4(18471, 0x7fa84affbd84, WNOHANG, NULL) = 0                
> [pid 18478] wait4(18470, 0x7fa84affbd84, WNOHANG, NULL) = 0                
> [pid 18478] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0

The wait4/select loop appears to indicate that it's waiting on a hung gnupg process.

Comment 27 Zac Medico gentoo-dev

2018-03-12 00:16:19 UTC

The hung command would be this command, with the Manifest signature fed through stdin:

gpg2 --batch --status-fd 1 --verify

Comment 28 Zac Medico gentoo-dev

2018-03-12 01:07:40 UTC

Since gemato is using python's subprocess.communicate() method here, I checked the code for any recent changes. The code looks good to me, with the last change in the relevant _communicate method being the addition of BrokenPipeError handling 2 years ago:

https://github.com/python/cpython/commit/1ef8c7e886ea5260e5a6967ec2b8a4c32640f1a8


If we can get some strace output for a hung /usr/bin/gpg2 process then hopefully it will provide a clue.

Comment 29 Alexander Tsoy 2018-03-12 06:33:23 UTC

(In reply to Zac Medico from comment #26)
> The wait4/select loop appears to indicate that it's waiting on a hung gnupg
> process.

PIDs it's waiting for correspond to the child gemato processes. And I don't see any gpg2 processes:

$ pgrep -au portage 
13757 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
13762 gpg-agent --homedir /tmp/tmpoqn42xe5 --use-standard-socket --daemon
13767 dirmngr --daemon --homedir /tmp/tmpoqn42xe5
13780 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
13781 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
13782 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg
13783 /usr/bin/python3.5 /usr/lib/python-exec/python3.5/gemato verify -s /var/db/repos/gentoo -K /var/lib/gentoo/gkeys/keyrings/gentoo/release/pubring.gpg

Comment 30 Zac Medico gentoo-dev

2018-03-12 07:11:18 UTC

(In reply to Alexander Tsoy from comment #29)
> (In reply to Zac Medico from comment #26)
> > The wait4/select loop appears to indicate that it's waiting on a hung gnupg
> > process.
> 
> PIDs it's waiting for correspond to the child gemato processes. And I don't
> see any gpg2 processes:

Yes, now I see that the wait4 pids in comment #0 correspond to the python processes. I don't understand the reason for those wait4 calls, maybe it's a clue, but it seems clear that gpg is not at fault.

Comment 31 Michael Brakemeier 2018-03-12 18:05:21 UTC

For the record:
Last night the notorious box did 41 cycles of "sleep 600 && rm /usr/portage/metadata/timestamp.chk && emaint -a sync" in a loop run from terminal with the gemato code patched according to comment #20 without hanging.

Still testing, now using a cron job. BBL.

Comment 32 Michał Górny archtester

2018-03-12 19:31:33 UTC

Before you get banned for abusing rsync, I think you should be able to reproduce it by just calling:

  gemato verify -P /your/repo/location

(-P can be used to rule GPG issues out)

Comment 33 Zac Medico gentoo-dev

2018-03-13 22:24:36 UTC

Hopefully using map instead of imap_unordered will prevent the deadlocks.

If we don't find a way to avoid multiprocessing.Pool deadlocks, as an alternative  there's the concurrent.futures.ProcessPoolExecutor class which is implemented differently and has a compatible map method:

https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map

Comment 34 Michael Brakemeier 2018-03-13 22:54:47 UTC

(In reply to Zac Medico from comment #33)
> Hopefully using map instead of imap_unordered will prevent the deadlocks.

I'm sorry but forget about the promising results from comment #31. I've missed that the default USE rsync-verify has been reverted while updating on sunday.

Comment 35 Zac Medico gentoo-dev

2018-03-13 23:52:57 UTC

Created attachment 523828 [details, diff]
Use concurrent.futures.ProcessPoolExecutor for python3.5+

With this patch, gemato will use concurrent.futures.ProcessPoolExecutor for python3.5+. The underlying implementation is different from multiprocessing.Pool, so hopefully it doesn't deadlock.

Comment 36 Michael Brakemeier 2018-03-18 21:01:06 UTC

For the record:
Current state - unable to reproduce the deadlock.

What I did:
- Actively selected python 3.6 using "eselect python update" (never did this before on the notoriously hanging box).
- Update world, including gemato 12.0.
- Re-merged portage with use-flag rsync-verify enabled.
-> one week over-night testing with sync once per hour via cronjob, no hang.

- Update world, including gemato 12.1
-> a weekend 24h testing with sync once per hour via cronjob, no hang.

Comment 37 Michał Górny archtester

2018-03-18 21:03:07 UTC

(In reply to Michael Brakemeier from comment #36)
> For the record:
> Current state - unable to reproduce the deadlock.
> 
> What I did:
> - Actively selected python 3.6 using "eselect python update" (never did this
> before on the notoriously hanging box).
> - Update world, including gemato 12.0.
> - Re-merged portage with use-flag rsync-verify enabled.
> -> one week over-night testing with sync once per hour via cronjob, no hang.
> 
> - Update world, including gemato 12.1
> -> a weekend 24h testing with sync once per hour via cronjob, no hang.

Would be nice if you could return to 3.5 and confirm it deadlocks again. Then we'd at least have a really reliable confirmation that it doesn't happen with 3.6.

Comment 38 Michał Górny archtester

2018-04-03 22:04:59 UTC

Given no hope of finding the issue, I think I'll be removing multiprocessing support from gemato in a few days in order to finally make it possible to stabilize it. Stability is more important than speed.

Comment 39 Zac Medico gentoo-dev

2018-04-04 00:19:46 UTC

(In reply to Michał Górny from comment #38)
> Given no hope of finding the issue, I think I'll be removing multiprocessing
> support from gemato in a few days in order to finally make it possible to
> stabilize it. Stability is more important than speed.

Maybe do something like the Pool class from attachment 523828 [details, diff] that reduces it to a plain map() call for the processes == 1 case.

Comment 40 Michał Górny archtester

2018-04-04 07:11:49 UTC

That doesn't help much if we either would have to force '-j 1' for everyone or tell users to check that it doesn't hang.

Comment 41 Michał Górny archtester

2018-04-04 07:28:34 UTC

Hmm, maybe I'll try replacing the other imap_unordered() with map() fallback for <3.5.4 but I suppose it won't help with this.

Comment 42 Michał Górny archtester

2018-04-04 17:09:33 UTC

Ok, please try 12.2. If it doesn't work, it will be probably the long-term ~arch target for 'brave downgrade'.

Comment 43 Michael Brakemeier 2018-04-04 18:58:44 UTC

I've tried to reproduce the deadlock with python 3.6 and also 3.5 as requested and gemato 12.1 for the last three weeks without result.

Whatever has been changed - it seems, that at least for me these changes were successful.

Comment 44 Michał Górny archtester

2018-04-07 06:59:23 UTC

Ping. Anyone else can/cannot reproduce this? This is crucial to knowing whether we can stabilize gemato or not.

Comment 45 Alexander Tsoy 2018-04-07 09:49:59 UTC

I finally managed to get access to a system where I saw most of deadlocks previously (Ivi Bridge 4C/8T) and also testing gemato-12.2 on other systems. No problems so far, but ideally it needs several days to make sure new gemato really fixed the problem.

Comment 46 Alexander Tsoy 2018-04-09 09:47:49 UTC

I tested gemato-12.2 + python-3.5 in a loop and unfortunately reproduced deadlock on two different systems. It took 257 iterations on one of the systems and 588 iterations on the other.

Comment 47 Zac Medico gentoo-dev

2018-04-09 16:10:59 UTC

Created attachment 526914 [details, diff]
Use concurrent.futures.ProcessPoolExecutor for python3.5+

Rebase on gemato-12.2.

Comment 48 Alexander Tsoy 2018-04-09 17:45:00 UTC

(In reply to Zac Medico from comment #47)
> Created attachment 526914 [details, diff] [details, diff]
> Use concurrent.futures.ProcessPoolExecutor for python3.5+
> 
> Rebase on gemato-12.2.

Got deadlock with this patch on the 5th iteration of the loop. =/

Comment 49 Zac Medico gentoo-dev

2018-04-09 18:37:21 UTC

(In reply to Alexander Tsoy from comment #48)
> (In reply to Zac Medico from comment #47)
> > Created attachment 526914 [details, diff] [details, diff] [details, diff]
> > Use concurrent.futures.ProcessPoolExecutor for python3.5+
> > 
> > Rebase on gemato-12.2.
> 
> Got deadlock with this patch on the 5th iteration of the loop. =/

It would be interesting to see what the traceback looks like when you kill it.

Comment 50 Alexander Tsoy 2018-04-09 19:10:22 UTC

(In reply to Zac Medico from comment #49)
Process Process-10:
Process Process-9:
Traceback (most recent call last):
  File "/usr/lib64/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/usr/lib64/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.5/concurrent/futures/process.py", line 169, in _process_worker
    call_item = call_queue.get(block=True)
  File "/usr/lib64/python3.5/multiprocessing/queues.py", line 93, in get
    with self._rlock:
  File "/usr/lib64/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.5/gemato", line 11, in <module>
    load_entry_point('gemato==12.2', 'console_scripts', 'gemato')()
  File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 549, in setuptools_main
    sys.exit(main(argv))
  File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 535, in main
    return vals.cmd()
  File "/usr/lib64/python3.5/site-packages/gemato/cli.py", line 231, in __call__
    ret &= m.assert_directory_verifies(relpath, **self.kwargs)
  File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 587, in assert_directory_verifies
    entry_dict = self.get_file_entry_dict(path)
  File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 520, in get_file_entry_dict
    verify=verify_manifests)
  File "/usr/lib64/python3.5/site-packages/gemato/recursiveloader.py", line 407, in load_manifests_for_path
    self.loaded_manifests.update(manifests)
  File "/usr/lib64/python3.5/concurrent/futures/_base.py", line 556, in result_iterator
    yield future.result()
  File "/usr/lib64/python3.5/concurrent/futures/_base.py", line 400, in result
    self._condition.wait(timeout)
  File "/usr/lib64/python3.5/threading.py", line 293, in wait
    waiter.acquire()
KeyboardInterrupt

Comment 51 Zac Medico gentoo-dev

2018-04-09 23:13:02 UTC

(In reply to Alexander Tsoy from comment #50)
>   File "/usr/lib64/python3.5/concurrent/futures/_base.py", line 400, in
> result
>     self._condition.wait(timeout)

It's waiting for a set_result call there, and those calls are managed by a _queue_management_worker thread that reads from a result_queue which is an instance of multiprocessing.SimpleQueue. So, it looks like the deadlock probably happens somewhere in the queue, much like https://bugs.python.org/issue32576.

Comment 52 Michał Górny archtester

2018-04-10 06:35:57 UTC

Created attachment 526984 [details, diff]
gc-hack.diff

Updated GC hack, in case anyone wanted to try it.

Comment 53 Alexander Tsoy 2018-04-10 08:00:48 UTC

(In reply to Michał Górny from comment #52)
> Created attachment 526984 [details, diff] [details, diff]
> gc-hack.diff
> 
> Updated GC hack, in case anyone wanted to try it.
Thanks. I'm testing it now.

Comment 54 Alexander Tsoy 2018-04-10 10:44:29 UTC

GC hack doesn't help. Just reproduced deadlock on two different machines again.

Comment 55 Michał Górny archtester

2018-04-10 11:32:55 UTC

Could you try with -j1? That would confirm whether it happens only with simultaneous op or if it's problem with message passing.

Comment 56 Alexander Tsoy 2018-04-12 22:26:52 UTC

(In reply to Michał Górny from comment #55)
No deadlocks for 2 days of testing with -j1.

Comment 57 Larry the Git Cow gentoo-dev

2018-04-13 13:31:26 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=ff513d3274d3d103d701025bcc8255d49b157456

commit ff513d3274d3d103d701025bcc8255d49b157456
Author:     Michał Górny <mgorny@gentoo.org>
AuthorDate: 2018-04-13 13:31:07 +0000
Commit:     Michał Górny <mgorny@gentoo.org>
CommitDate: 2018-04-13 13:31:14 +0000

    app-portage/gemato: Bump to 13.0 (non-parallel version)
    
    Bug: https://bugs.gentoo.org/647964

 app-portage/gemato/Manifest           |  1 +
 app-portage/gemato/gemato-13.0.ebuild | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)}

Comment 58 Larry the Git Cow gentoo-dev

2018-04-17 02:26:40 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/proj/portage.git/commit/?id=7b448e90034de00bc177b3d809aeaf9b94d55ee2

commit 7b448e90034de00bc177b3d809aeaf9b94d55ee2
Author:     Zac Medico <zmedico@gentoo.org>
AuthorDate: 2018-04-13 17:01:45 +0000
Commit:     Zac Medico <zmedico@gentoo.org>
CommitDate: 2018-04-17 02:25:25 +0000

    rsync: default to sync-rsync-verify-jobs = 1 (bug 650696)
    
    Some users have reported that using all processors to verify
    manifests results in overloading, therefore default to using
    a single processor. On modern hardware, verification of the
    gentoo repository completes in less than 20 seconds, so using
    multiple processors is not really necessary. Also, gemato-13.0
    disables parallel verification due to the unresolved deadlock
    issue reported in bug 647964, so this brings the default
    portage configuration into alignment with current gemato
    behavior.
    
    Bug: https://bugs.gentoo.org/650696
    Bug: https://bugs.gentoo.org/647964

 cnf/repos.conf                          | 1 +
 man/portage.5                           | 8 +++++---
 pym/portage/sync/modules/rsync/rsync.py | 7 ++++++-
 3 files changed, 12 insertions(+), 4 deletions(-)}

Comment 59 Larry the Git Cow gentoo-dev

2018-08-01 11:36:31 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=896aadcc7d554a275953ac86458b0ecbbc0df55a

commit 896aadcc7d554a275953ac86458b0ecbbc0df55a
Author:     Michał Górny <mgorny@gentoo.org>
AuthorDate: 2018-08-01 11:02:31 +0000
Commit:     Michał Górny <mgorny@gentoo.org>
CommitDate: 2018-08-01 11:36:24 +0000

    app-portage/gemato: Add 14.0+multiprocessing version
    
    Bug: https://bugs.gentoo.org/647964

 app-portage/gemato/Manifest            |  1 +
 app-portage/gemato/gemato-14.0m.ebuild | 53 ++++++++++++++++++++++++++++++++++
 profiles/package.mask                  |  1 +
 3 files changed, 55 insertions(+)

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=ec821e92460679181bf2d63a8f899f0b8fbf7e24

commit ec821e92460679181bf2d63a8f899f0b8fbf7e24
Author:     Michał Górny <mgorny@gentoo.org>
AuthorDate: 2018-08-01 10:09:10 +0000
Commit:     Michał Górny <mgorny@gentoo.org>
CommitDate: 2018-08-01 11:36:24 +0000

    app-portage/gemato: Add a live ebuild for multiprocessing branch
    
    Add a separate ebuild for multiprocessing branch and keep it p.masked
    due to the known hangs.  Hopefully it will help testing and finding
    a proper solution to that.
    
    Bug: https://bugs.gentoo.org/647964

 app-portage/gemato/gemato-9999m.ebuild | 50 ++++++++++++++++++++++++++++++++++
 profiles/package.mask                  |  6 ++++
 2 files changed, 56 insertions(+)

Comment 60 Michał Górny archtester

2018-08-01 12:21:35 UTC

I'd appreciate if you could try reproducing the issues with the newest version.  Once you confirm it still happens, please try if python3.7 solves it.

Comment 61 Alexander Tsoy 2018-08-05 11:08:22 UTC

(In reply to Michał Górny from comment #60)
> I'd appreciate if you could try reproducing the issues with the newest
> version.  Once you confirm it still happens, please try if python3.7 solves
> it.

Can't reproduce it anymore. I tested it with python3.6.

Comment 62 Zac Medico gentoo-dev

2023-09-28 05:19:59 UTC

This may have been due to mixing of threading with the default 'fork' multiprocessing start method, which happens in multiprocessing.Pool. Due to deadlocks, in https://github.com/python/cpython/issues/84559 they are planning to change the default multiprocessing start method to 'spawn' in python 3.14.

Comment 63 Sam James archtester

2023-12-24 18:04:21 UTC

See https://github.com/python/cpython/issues/84559#issuecomment-1093869867 too...