Summary: | sys-libs/libomp-5.0.9999 test suite randomly deadlocks (in multiple tests) | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Kent Fredric (IRC: kent\n) (RETIRED) <kentnl> |
Component: | Current packages | Assignee: | Bernard Cafarelli <voyageur> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | hahnjo, llvm, mgorny |
Priority: | Normal | Keywords: | TESTFAILURE |
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
See Also: | https://bugs.llvm.org/show_bug.cgi?id=35731 | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: | sys-libs/libomp-5.0.0:20171121-070509.log |
Description
Kent Fredric (IRC: kent\n) (RETIRED)
2017-11-21 22:17:47 UTC
Curious enough, I just had the same problem with worksharing/for/kmp_sch_simd_guided.c. The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=22506561ddd8202cd93fd85f92152637a418d600 commit 22506561ddd8202cd93fd85f92152637a418d600 Author: Michał Górny <mgorny@gentoo.org> AuthorDate: 2017-12-22 15:56:31 +0000 Commit: Michał Górny <mgorny@gentoo.org> CommitDate: 2017-12-22 15:57:25 +0000 sys-libs/libomp: Restrict tests to avoid hangs Bug: https://bugs.gentoo.org/638410 sys-libs/libomp/libomp-4.0.1.ebuild | 2 ++ sys-libs/libomp/libomp-5.0.0.ebuild | 2 ++ sys-libs/libomp/libomp-5.0.1.ebuild | 2 ++ sys-libs/libomp/libomp-9999.ebuild | 3 ++- 4 files changed, 8 insertions(+), 1 deletion(-)} Ok, good news is, I've been able to figure out the cause of my hang. It was due to PDU scheduler from -pf kernels. Do you use that as well or does your hang have a different cause? The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=635af6abf583f1e17860c0eb72cfe74b474bdfdf commit 635af6abf583f1e17860c0eb72cfe74b474bdfdf Author: Michał Górny <mgorny@gentoo.org> AuthorDate: 2017-12-30 22:30:29 +0000 Commit: Michał Górny <mgorny@gentoo.org> CommitDate: 2017-12-30 22:43:39 +0000 sys-libs/libomp: Disallow kernels with PDU scheduler The PDU scheduler (used e.g. in current versions of -pf kernel) does not implement the sched_yield() call which is used by the OpenMP implementation to switch between threads. As a result, using OpenMP with this scheduler results in horrible performance with 100% CPU usage on looped noop syscall calls. Closes: https://bugs.gentoo.org/638410 sys-libs/libomp/libomp-4.0.1.ebuild | 13 ++++++++++--- sys-libs/libomp/libomp-5.0.0.ebuild | 13 ++++++++++--- sys-libs/libomp/libomp-5.0.1.ebuild | 13 ++++++++++--- sys-libs/libomp/libomp-9999.ebuild | 13 ++++++++++--- 4 files changed, 40 insertions(+), 12 deletions(-) Would it be possible to do the config check only when tests are enabled? Regular (work-accomplishing) OpenMP apps and coordination primitives run just fine with PDS since they don't needlessly bang their heads together without making progress. In any case I've just complained on the PDS blog. The sched_yield() removal commit in PDS can be easily reverted if onme is so inclined, but still this is disappointing. Technically it is possible but I'm not convinced it's a good idea. I don't know much about this library but I can see that the sched_yield() calls can legally occur within the library code itself. I'm not sure how likely this problem is for regular programs but I suspect it can harm performance. Finally, upstream declared that they require POSIX-compliant CPU scheduler behavior and this could cause any kind of breakage in the future. (In reply to Michał Górny from comment #6) > Technically it is possible but I'm not convinced it's a good idea. I don't > know much about this library but I can see that the sched_yield() calls can > legally occur within the library code itself. The calls to sched_yield() are made for barriers when one thread reaches the synchronization point and wants to give other threads the possibility to finish as well. Thread switching is most important here when the machine is oversubscribed, ie there are more threads than cores. When every thread has its own core, sched_yield() will find no other runnable thread anyway because all threads are already executing in parallel. > I'm not sure how likely this problem is for regular programs but I suspect it > can harm performance. Barriers are found in _EVERY_ OpenMP program, literally: There are explicit and implicit barriers, the latter for example at the end of a parallel region. Not oversubscribing the machine might work, but I'd rather not risk it. > Finally, upstream declared that they require POSIX-compliant CPU scheduler > behavior and this could cause any kind of breakage in the future. Please rather take this as my personal opinion :-) My point is that applications and libraries are building on lower level parts of a system. POSIX is one of the most fundamental standards, declares sched_yield() and defines what functionality it should provide (see LLVM Bugzilla for my full analysis). If that's not met, there isn't much the library can do... I have been able to convince the PDS maintainer to restore sched_yield() support, and the latest release is now available for 4.14.x, see: http://cchalpha.blogspot.de/2018/01/pds-098i-release.html It now has a meaningful implementation of sched_yield() again, enabled by default. \o/ As proof I have been running the libomp tests myself with different settings and have learned that the test failures were NOT repeatably caused by this (probably some other early 5.0 bug); in fact I didn't have any failures even with sched_yield() as nop. However, excessive runtime certainly is observable. With sched_yield() restored to actually do something, the tests generally run much faster (in hindsight obvious). Revert please? :) Maybe instead of preventing the build outright, just issue a warning? (In reply to Holger Hoffstätte from comment #9) > Maybe instead of preventing the build outright, just issue a warning? Unless I've done something wrong, it *was* supposed to be a warning and not a fatal error. (In reply to Michał Górny from comment #10) > (In reply to Holger Hoffstätte from comment #9) > > Maybe instead of preventing the build outright, just issue a warning? > > Unless I've done something wrong, it *was* supposed to be a warning and not > a fatal error. You are right! I misread the ebuild, it's all good and works. Cheers! \o/ If you wouldn't mind helping me a bit, I'd find it helpful if you found out what version range is affected. (In reply to Michał Górny from comment #12) > If you wouldn't mind helping me a bit, I'd find it helpful if you found out > what version range is affected. It was set to 0 (no yield) in 0.98c (http://cchalpha.blogspot.de/2017/10/pds-098c-release.html), removed in 0.98f (http://cchalpha.blogspot.de/2017/11/pds-098f-release.html) and restored in 0.98i (see above). The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=a4d33e643f3aa8cb5cf8b0555328e846e4f6f9de commit a4d33e643f3aa8cb5cf8b0555328e846e4f6f9de Author: Michał Górny <mgorny@gentoo.org> AuthorDate: 2018-01-18 21:23:13 +0000 Commit: Michał Górny <mgorny@gentoo.org> CommitDate: 2018-01-18 21:25:17 +0000 sys-libs/libomp: List broken PDS scheduler versions The sched_yield() call has been reintroduced in PDS 0.98i. Improve the kernel check to explicitly list which PDS versions are affected, and which -pf kernels are affected (sadly, no fixed version yet). Big thanks to Holger Hoffstätte for convincing upstream to fix this and all the research! Bug: https://bugs.gentoo.org/638410 sys-libs/libomp/libomp-4.0.1.ebuild | 5 ++++- sys-libs/libomp/libomp-5.0.1.ebuild | 5 ++++- sys-libs/libomp/libomp-6.0.9999.ebuild | 5 ++++- sys-libs/libomp/libomp-9999.ebuild | 5 ++++- 4 files changed, 16 insertions(+), 4 deletions(-)} The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=0d9c0f4a785d5021d3c01307388d0ecbf0f63cd4 commit 0d9c0f4a785d5021d3c01307388d0ecbf0f63cd4 Author: Michał Górny <mgorny@gentoo.org> AuthorDate: 2018-01-30 19:29:00 +0000 Commit: Michał Górny <mgorny@gentoo.org> CommitDate: 2018-01-30 19:31:21 +0000 sys-libs/libomp: Perform PDS checks only for relevant kernel versions Update the PDS check logic to apply only when running the Linux kernel, versions between 4.13 and 4.15. That covers the range of -pf kernels that have the broken PDS version, and I think we can reasonably assume users will not be updating the patch along with the kernel. Also, perform the check only once in pkg_pretend. There is really no point in repeating it as packages do not alter kernel configuration. Closes: https://bugs.gentoo.org/638410 sys-libs/libomp/libomp-4.0.1.ebuild | 24 +++++++++++++++--------- sys-libs/libomp/libomp-5.0.1.ebuild | 24 +++++++++++++++--------- sys-libs/libomp/libomp-6.0.9999.ebuild | 24 +++++++++++++++--------- sys-libs/libomp/libomp-9999.ebuild | 24 +++++++++++++++--------- 4 files changed, 60 insertions(+), 36 deletions(-) |