Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 638410

Summary: sys-libs/libomp-5.0.9999 test suite randomly deadlocks (in multiple tests)
Product: Gentoo Linux Reporter: Kent Fredric (IRC: kent\n) (RETIRED) <kentnl>
Component: Current packagesAssignee: Bernard Cafarelli <voyageur>
Status: RESOLVED FIXED    
Severity: normal CC: hahnjo, llvm, mgorny
Priority: Normal Keywords: TESTFAILURE
Version: unspecified   
Hardware: All   
OS: Linux   
See Also: https://bugs.llvm.org/show_bug.cgi?id=35731
Whiteboard:
Package list:
Runtime testing required: ---
Attachments: sys-libs/libomp-5.0.0:20171121-070509.log

Description Kent Fredric (IRC: kent\n) (RETIRED) gentoo-dev 2017-11-21 22:17:47 UTC
Created attachment 505584 [details]
sys-libs/libomp-5.0.0:20171121-070509.log

This was left running over night and it was still going in the morning when I woke.

I then sent one of the workers a SIGTERM to abort the tests


>>> Working in BUILD_DIR: "/var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src-abi_x86_64.amd64"
ninja -v -j3 -l0 check-libomp
[0/1] cd /var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src-abi_x86_64.amd64/runtime/test && /var/tmp/portage/sys-libs/libomp-5
.0.0/temp/python2.7/bin/python /usr/bin/lit -sv --show-unsupported --show-xfail /var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.
src-abi_x86_64.amd64/runtime/test
-- Testing: 108 tests, 2 threads --
Testing: 0 .. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..
FAIL: libomp :: misc_bugs/cancellation_for_sections.c (108 of 108)
******************** TEST 'libomp :: misc_bugs/cancellation_for_sections.c' FAILED ********************
Script:
--
/usr/bin/x86_64-pc-linux-gnu-clang -fopenmp -I /var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src/runtime/test -I /var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src-abi_x86_64.amd64/runtime/src -L /var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src-abi_x86_64.amd64/runtime/src  /var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src/runtime/test/misc_bugs/cancellation_for_sections.c -o /var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src-abi_x86_64.amd64/runtime/test/misc_bugs/Output/cancellation_for_sections.c.tmp -lm -latomic && env OMP_CANCELLATION=true /var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src-abi_x86_64.amd64/runtime/test/misc_bugs/Output/cancellation_for_sections.c.tmp
--
Exit Code: -15

Command Output (stdout):
--
$ "/usr/bin/x86_64-pc-linux-gnu-clang" "-fopenmp" "-I" "/var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src/runtime/test" "-I" "/var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src-abi_x86_64.amd64/runtime/src" "-L" "/var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src-abi_x86_64.amd64/runtime/src" "/var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src/runtime/test/misc_bugs/cancellation_for_sections.c" "-o" "/var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src-abi_x86_64.amd64/runtime/test/misc_bugs/Output/cancellation_for_sections.c.tmp" "-lm" "-latomic"
$ "/var/tmp/portage/sys-libs/libomp-5.0.0/work/openmp-5.0.0.src-abi_x86_64.amd64/runtime/test/misc_bugs/Output/cancellation_for_sections.c.tmp"
note: command had no output on stdout or stderr
error: command failed with exit status: -15

--

********************
Testing Time: 49394.21s
********************
Failing Tests (1):
    libomp :: misc_bugs/cancellation_for_sections.c

...

********************
Expected Failing Tests (1):
    libomp :: worksharing/for/omp_for_bigbounds.c

  Expected Passes    : 90
  Expected Failures  : 1
  Unsupported Tests  : 16
  Unexpected Failures: 1


 * Package:    sys-libs/libomp-5.0.0
 * Repository: gentoo
 * Maintainer: voyageur@gentoo.org llvm@gentoo.org
 * USE:        abi_x86_64 amd64 elibc_glibc kernel_linux test userland_GNU
 * FEATURES:   ccache preserve-libs sandbox test userpriv usersandbox
Comment 1 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2017-12-02 16:33:04 UTC
Curious enough, I just had the same problem with worksharing/for/kmp_sch_simd_guided.c.
Comment 2 Larry the Git Cow gentoo-dev 2017-12-22 15:57:33 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=22506561ddd8202cd93fd85f92152637a418d600

commit 22506561ddd8202cd93fd85f92152637a418d600
Author:     Michał Górny <mgorny@gentoo.org>
AuthorDate: 2017-12-22 15:56:31 +0000
Commit:     Michał Górny <mgorny@gentoo.org>
CommitDate: 2017-12-22 15:57:25 +0000

    sys-libs/libomp: Restrict tests to avoid hangs
    
    Bug: https://bugs.gentoo.org/638410

 sys-libs/libomp/libomp-4.0.1.ebuild | 2 ++
 sys-libs/libomp/libomp-5.0.0.ebuild | 2 ++
 sys-libs/libomp/libomp-5.0.1.ebuild | 2 ++
 sys-libs/libomp/libomp-9999.ebuild  | 3 ++-
 4 files changed, 8 insertions(+), 1 deletion(-)}
Comment 3 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2017-12-30 18:35:38 UTC
Ok, good news is, I've been able to figure out the cause of my hang. It was due to PDU scheduler from -pf kernels. Do you use that as well or does your hang have a different cause?
Comment 4 Larry the Git Cow gentoo-dev 2017-12-30 22:43:46 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=635af6abf583f1e17860c0eb72cfe74b474bdfdf

commit 635af6abf583f1e17860c0eb72cfe74b474bdfdf
Author:     Michał Górny <mgorny@gentoo.org>
AuthorDate: 2017-12-30 22:30:29 +0000
Commit:     Michał Górny <mgorny@gentoo.org>
CommitDate: 2017-12-30 22:43:39 +0000

    sys-libs/libomp: Disallow kernels with PDU scheduler
    
    The PDU scheduler (used e.g. in current versions of -pf kernel) does not
    implement the sched_yield() call which is used by the OpenMP
    implementation to switch between threads. As a result, using OpenMP with
    this scheduler results in horrible performance with 100% CPU usage
    on looped noop syscall calls.
    
    Closes: https://bugs.gentoo.org/638410

 sys-libs/libomp/libomp-4.0.1.ebuild | 13 ++++++++++---
 sys-libs/libomp/libomp-5.0.0.ebuild | 13 ++++++++++---
 sys-libs/libomp/libomp-5.0.1.ebuild | 13 ++++++++++---
 sys-libs/libomp/libomp-9999.ebuild  | 13 ++++++++++---
 4 files changed, 40 insertions(+), 12 deletions(-)
Comment 5 Holger Hoffstätte 2017-12-30 23:39:33 UTC
Would it be possible to do the config check only when tests are enabled?

Regular (work-accomplishing) OpenMP apps and coordination primitives run just fine with PDS since they don't needlessly bang their heads together without making progress.

In any case I've just complained on the PDS blog. The sched_yield() removal commit in PDS can be easily reverted if onme is so inclined, but still this is disappointing.
Comment 6 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2018-01-01 11:05:01 UTC
Technically it is possible but I'm not convinced it's a good idea. I don't know much about this library but I can see that the sched_yield() calls can legally occur within the library code itself. I'm not sure how likely this problem is for regular programs but I suspect it can harm performance. Finally, upstream declared that they require POSIX-compliant CPU scheduler behavior and this could cause any kind of breakage in the future.
Comment 7 Jonas Hahnfeld 2018-01-01 13:11:06 UTC
(In reply to Michał Górny from comment #6)
> Technically it is possible but I'm not convinced it's a good idea. I don't
> know much about this library but I can see that the sched_yield() calls can
> legally occur within the library code itself.

The calls to sched_yield() are made for barriers when one thread reaches the synchronization point and wants to give other threads the possibility to finish as well. Thread switching is most important here when the machine is oversubscribed, ie there are more threads than cores. When every thread has its own core, sched_yield() will find no other runnable thread anyway because all threads are already executing in parallel.

> I'm not sure how likely this problem is for regular programs but I suspect it
> can harm performance.

Barriers are found in _EVERY_ OpenMP program, literally: There are explicit and implicit barriers, the latter for example at the end of a parallel region. Not oversubscribing the machine might work, but I'd rather not risk it.

> Finally, upstream declared that they require POSIX-compliant CPU scheduler
> behavior and this could cause any kind of breakage in the future.

Please rather take this as my personal opinion :-) My point is that applications and libraries are building on lower level parts of a system. POSIX is one of the most fundamental standards, declares sched_yield() and defines what functionality it should provide (see LLVM Bugzilla for my full analysis). If that's not met, there isn't much the library can do...
Comment 8 Holger Hoffstätte 2018-01-16 11:05:49 UTC
I have been able to convince the PDS maintainer to restore sched_yield() support, and the latest release is now available for 4.14.x,
see: http://cchalpha.blogspot.de/2018/01/pds-098i-release.html
It now has a meaningful implementation of sched_yield() again, enabled
by default. \o/

As proof I have been running the libomp tests myself with different
settings and have learned that the test failures were NOT repeatably
caused by this (probably some other early 5.0 bug); in fact I didn't have
any failures even with sched_yield() as nop. However, excessive runtime
certainly is observable. With sched_yield() restored to actually do
something, the tests generally run much faster (in hindsight obvious).

Revert please? :)
Comment 9 Holger Hoffstätte 2018-01-16 11:07:05 UTC
Maybe instead of preventing the build outright, just issue a warning?
Comment 10 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2018-01-16 11:28:40 UTC
(In reply to Holger Hoffstätte from comment #9)
> Maybe instead of preventing the build outright, just issue a warning?

Unless I've done something wrong, it *was* supposed to be a warning and not a fatal error.
Comment 11 Holger Hoffstätte 2018-01-18 10:40:52 UTC
(In reply to Michał Górny from comment #10)
> (In reply to Holger Hoffstätte from comment #9)
> > Maybe instead of preventing the build outright, just issue a warning?
> 
> Unless I've done something wrong, it *was* supposed to be a warning and not
> a fatal error.

You are right! I misread the ebuild, it's all good and works.
Cheers! \o/
Comment 12 Michał Górny archtester Gentoo Infrastructure gentoo-dev Security 2018-01-18 11:27:08 UTC
If you wouldn't mind helping me a bit, I'd find it helpful if you found out what version range is affected.
Comment 13 Holger Hoffstätte 2018-01-18 20:30:39 UTC
(In reply to Michał Górny from comment #12)
> If you wouldn't mind helping me a bit, I'd find it helpful if you found out
> what version range is affected.

It was set to 0 (no yield) in 0.98c (http://cchalpha.blogspot.de/2017/10/pds-098c-release.html), removed in 0.98f (http://cchalpha.blogspot.de/2017/11/pds-098f-release.html) and restored in 0.98i (see above).
Comment 14 Larry the Git Cow gentoo-dev 2018-01-18 21:25:25 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=a4d33e643f3aa8cb5cf8b0555328e846e4f6f9de

commit a4d33e643f3aa8cb5cf8b0555328e846e4f6f9de
Author:     Michał Górny <mgorny@gentoo.org>
AuthorDate: 2018-01-18 21:23:13 +0000
Commit:     Michał Górny <mgorny@gentoo.org>
CommitDate: 2018-01-18 21:25:17 +0000

    sys-libs/libomp: List broken PDS scheduler versions
    
    The sched_yield() call has been reintroduced in PDS 0.98i. Improve
    the kernel check to explicitly list which PDS versions are affected,
    and which -pf kernels are affected (sadly, no fixed version yet).
    Big thanks to Holger Hoffstätte for convincing upstream to fix this
    and all the research!
    
    Bug: https://bugs.gentoo.org/638410

 sys-libs/libomp/libomp-4.0.1.ebuild    | 5 ++++-
 sys-libs/libomp/libomp-5.0.1.ebuild    | 5 ++++-
 sys-libs/libomp/libomp-6.0.9999.ebuild | 5 ++++-
 sys-libs/libomp/libomp-9999.ebuild     | 5 ++++-
 4 files changed, 16 insertions(+), 4 deletions(-)}
Comment 15 Larry the Git Cow gentoo-dev 2018-01-30 19:31:31 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=0d9c0f4a785d5021d3c01307388d0ecbf0f63cd4

commit 0d9c0f4a785d5021d3c01307388d0ecbf0f63cd4
Author:     Michał Górny <mgorny@gentoo.org>
AuthorDate: 2018-01-30 19:29:00 +0000
Commit:     Michał Górny <mgorny@gentoo.org>
CommitDate: 2018-01-30 19:31:21 +0000

    sys-libs/libomp: Perform PDS checks only for relevant kernel versions
    
    Update the PDS check logic to apply only when running the Linux kernel,
    versions between 4.13 and 4.15. That covers the range of -pf kernels
    that have the broken PDS version, and I think we can reasonably assume
    users will not be updating the patch along with the kernel.
    
    Also, perform the check only once in pkg_pretend. There is really
    no point in repeating it as packages do not alter kernel configuration.
    
    Closes: https://bugs.gentoo.org/638410

 sys-libs/libomp/libomp-4.0.1.ebuild    | 24 +++++++++++++++---------
 sys-libs/libomp/libomp-5.0.1.ebuild    | 24 +++++++++++++++---------
 sys-libs/libomp/libomp-6.0.9999.ebuild | 24 +++++++++++++++---------
 sys-libs/libomp/libomp-9999.ebuild     | 24 +++++++++++++++---------
 4 files changed, 60 insertions(+), 36 deletions(-)