740110 – sys-libs/glibc-2.32-r1 - SIGILL in libm with USE="multiarch" when running emerge/portageq...

Bug 740110 - sys-libs/glibc-2.32-r1 - SIGILL in libm with USE="multiarch" when running emerge/portageq...

Summary: sys-libs/glibc-2.32-r1 - SIGILL in libm with USE="multiarch" when running eme...

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Gentoo Toolchain Maintainers

URL:	https://sourceware.org/PR26534
Whiteboard:
Keywords:	PATCH

Duplicates (1):	744586 (view as bug list)
Depends on:
Blocks:

Reported:	2020-09-02 19:21 UTC by Sven E.
Modified:	2020-09-25 18:54 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
emerge --info (emerge.info,5.21 KB, application/x-info) 2020-09-02 19:25 UTC, Sven E.	Details
build.log (sys-libs:glibc-2.32-r1:20200903-193925.log.bz2,214.24 KB, text/plain) 2020-09-03 20:20 UTC, Sven E.	Details
libm.so (libm-2.32.so.bz2,605.23 KB, application/octet-stream) 2020-09-03 20:23 UTC, Sven E.	Details
backtrace (backtrace.bz2,1.11 KB, application/octet-stream) 2020-09-03 20:25 UTC, Sven E.	Details
cpuid.txt.bz2 (cpuid.txt.bz2,4.97 KB, text/plain) 2020-09-03 21:48 UTC, Sven E.	Details
cpuid.raw.txt.bz2 (cpuid.raw.txt.bz2,485 bytes, text/plain) 2020-09-03 21:48 UTC, Sven E.	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sven E. 2020-09-02 19:21:48 UTC

After upgrading to glibc-2.32-r1 emerge and portageq receive SIGILL and terminate:
# emerge
Illegal instruction
# portageq
Illegal instruction


Reproducible: Always

Steps to Reproduce:
1. Can reproduce it on two similiar machines just by upgrading glibc

Actual Results:  
Broken glibc

Expected Results:  
Working glibc

dmesg:
[17259097.333206] traps: emerge[7462] trap invalid opcode ip:7f9d56390868 sp:7ffde023e558 error:0
[17259097.333212]  in libm-2.32.so[7f9d56317000+9b000]
[17259112.509811] traps: portageq[7465] trap invalid opcode ip:7f1e2e351868 sp:7ffe829d55c8 error:0
[17259112.509816]  in libm-2.32.so[7f1e2e2d8000+9b000]

The offset 9b000 is identical on both system, but changes with gcc-version (minor difference in layout, I assume)

Building with USE="-multiarch" yields working glibc. I will post follow ups with addiotional information.

Comment 1 Sven E. 2020-09-02 19:25:29 UTC

Created attachment 657966 [details]
emerge --info

emerge --info

Comment 2 Sven E. 2020-09-02 21:46:23 UTC

With FEATURES="test", I am getting for USE="multiarch":
FAIL: elf/tst-ldconfig-ld_so_conf-update
FAIL: io/tst-copy_file_range
FAIL: math/test-double-acos
FAIL: math/test-double-asin
FAIL: math/test-double-pow
FAIL: math/test-double-tgamma
FAIL: math/test-double-vlen2-pow
FAIL: math/test-double-vlen4-pow
FAIL: math/test-float32x-acos
FAIL: math/test-float32x-asin
FAIL: math/test-float32x-pow
FAIL: math/test-float32x-tgamma
FAIL: math/test-float64-acos
FAIL: math/test-float64-asin
FAIL: math/test-float64-pow
FAIL: math/test-float64-tgamma
FAIL: stdlib/tst-system
FAIL: string/tst-strerror
FAIL: string/tst-strsignal
Summary of test results:
     19 FAIL
--
With FEATURES="test" and USE="-multiarch":
FAIL: elf/tst-ldconfig-ld_so_conf-update
FAIL: io/tst-copy_file_range
FAIL: stdlib/tst-system
FAIL: string/tst-strerror
FAIL: string/tst-strsignal
Summary of test results:
      5 FAIL
--
There semms to be quite a difference regarding the math tests.

Comment 3 Sven E. 2020-09-03 13:38:30 UTC

Short addition since the description was updated.

The problem occurs with glibc versions 2.32 and 2.32-r1 (so, I assume it'S not the patches added in r1 causing this), I tried gcc versions 9.2, 9.3, 10.1, 10.2, binutils 2.32, 2.33.1, 2.34. So as far as I can tell gcc and binutils versions do not really matter.

Comment 4 Sergei Trofimovich (RETIRED) gentoo-dev

2020-09-03 17:44:54 UTC

Please attach glibc's build.log and get a backtrace with illegal instruction.

Usually you can use core dump and gdb for that. Something like:

  $ gdb path/to/executable path/to/corecore
  (gdb) bt
  (gdb) disassemble

Comment 5 Sven E. 2020-09-03 18:16:15 UTC

Can I force emerge to keep the build.log?

Because the problems start in postrm:
   /usr/bin/sprof
   /usr/bin/pldd
   /sbin/sln
   /sbin/ldconfig

>>> Installing (1 of 1) sys-libs/glibc-2.32-r1::gentoo
 * Defaulting /etc/host.conf:multi to on
/usr/lib/portage/python3.7/phase-functions.sh: line 931:  4114 Illegal instruction     "$PORTAGE_BIN_PATH"/ebuild-ipc exit $?

Regarding gdb, will need to emerge that first, might take a while.

Comment 6 Sergei Trofimovich (RETIRED) gentoo-dev

2020-09-03 18:39:10 UTC

You can use PORTAGE_LOGDIR= variable to store build logs of successful builds. Something like:
    # PORTAGE_LOGDIR=/path/to/result emerge -v1 glibc

Comment 7 Sergei Trofimovich (RETIRED) gentoo-dev

2020-09-03 18:41:52 UTC

Can you also upload bad libm-2.32.so binary? I'll try to look at exact instruction that hides at problematic offset.

Comment 8 Sven E. 2020-09-03 20:20:33 UTC

Created attachment 658202 [details]
build.log

Full build.log

Comment 9 Sven E. 2020-09-03 20:23:32 UTC

Created attachment 658204 [details]
libm.so

Defective libm-2.32.so (compressed)

Comment 10 Sven E. 2020-09-03 20:25:24 UTC

Created attachment 658206 [details]
backtrace

Comment 11 Sven E. 2020-09-03 20:27:07 UTC

Disassemble didn't work.

Let me emerge glibc again, and let me see, if I can create a minimalistic prog to trigger the problem.

Comment 12 Sven E. 2020-09-03 21:11:05 UTC

Okay, the minimalistic example I wrote in C calling pow indeed gets SIGILL aswell:

Program received signal SIGILL, Illegal instruction.
0x00007ffff7f09868 in ?? () from /lib64/libm.so.6

(gdb) bt
#0  0x00007ffff7f09868 in ?? () from /lib64/libm.so.6
#1  0x00007ffff7ec3484 in powf64 () from /lib64/libm.so.6
#2  0x00005555555551df in main () at math.c:8
(gdb) disassemble 0x00007ffff7f09860,0x00007ffff7f098ff
Dump of assembler code from 0x7ffff7f09860 to 0x7ffff7f098ff:
   0x00007ffff7f09860:  add    %rcx,%rdx
   0x00007ffff7f09863:  vmovq  %rax,%xmm6
=> 0x00007ffff7f09868:  vfmaddsd %xmm4,0x8(%rdx),%xmm6,%xmm0
   0x00007ffff7f0986f:  vmovsd 0x594e9(%rip),%xmm6        # 0x7ffff7f62d60
   0x00007ffff7f09877:  vmulsd 0x594f1(%rip),%xmm0,%xmm9        # 0x7ffff7f62d70
   0x00007ffff7f0987f:  vfmaddsd 0x18(%rdx),%xmm6,%xmm2,%xmm3
   0x00007ffff7f09886:  vmovsd 0x594da(%rip),%xmm6        # 0x7ffff7f62d68

This would be an FMA4 Instruction, if I am not mistaken.

Comment 13 Sergei Trofimovich (RETIRED) gentoo-dev

2020-09-03 21:42:24 UTC

(In reply to Sven E. from comment #12)
> Okay, the minimalistic example I wrote in C calling pow indeed gets SIGILL
> aswell:
> 
> Program received signal SIGILL, Illegal instruction.
> 0x00007ffff7f09868 in ?? () from /lib64/libm.so.6
> 
> (gdb) bt
> #0  0x00007ffff7f09868 in ?? () from /lib64/libm.so.6
> #1  0x00007ffff7ec3484 in powf64 () from /lib64/libm.so.6
> #2  0x00005555555551df in main () at math.c:8
> (gdb) disassemble 0x00007ffff7f09860,0x00007ffff7f098ff
> Dump of assembler code from 0x7ffff7f09860 to 0x7ffff7f098ff:
>    0x00007ffff7f09860:  add    %rcx,%rdx
>    0x00007ffff7f09863:  vmovq  %rax,%xmm6
> => 0x00007ffff7f09868:  vfmaddsd %xmm4,0x8(%rdx),%xmm6,%xmm0
>    0x00007ffff7f0986f:  vmovsd 0x594e9(%rip),%xmm6        # 0x7ffff7f62d60
>    0x00007ffff7f09877:  vmulsd 0x594f1(%rip),%xmm0,%xmm9        #
> 0x7ffff7f62d70
>    0x00007ffff7f0987f:  vfmaddsd 0x18(%rdx),%xmm6,%xmm2,%xmm3
>    0x00007ffff7f09886:  vmovsd 0x594da(%rip),%xmm6        # 0x7ffff7f62d68
> 
> This would be an FMA4 Instruction, if I am not mistaken.

Yeah, it's a 4-operand FMA. That means glibc mistakenly detects your CPU as capable of AVX+FMA4 and enable that implementation for trigonometry.

There are a few moving parts here:
1. basic kernel needs support for AVX context save/restore
2. glibc cpuid detection of features on your CPU

Can you install sys-apps/cpuid and upload output of 'cupid' and 'cpuid --raw'? It should report a bunch of leaf values and might ease tracing through CPU features detections.

Or you can try yourself. glibc detects features at:

sysdeps/x86/cpu-features.c:init_cpu_features():

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86/cpu-features.c;h=b0ded20486f299535fa3cbcb2f9021aaf3ab8503;hb=HEAD#l359

The specific things we are looking for are bits that enable fma implementation.

I think it is a

sysdeps/x86_64/fpu/multiarch/e_powf.c:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/multiarch/e_powf.c;h=c5bd42b099b581efd35c5b166829661dfb83d0f2;hb=HEAD#l30

There glibc uses only FMA selector:

#include "ifunc-fma.h"

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/multiarch/ifunc-fma.h;h=0a25a44ab083093f5374f4c492ff073d5fdb8d91;hb=HEAD#l24

"""
  24 static inline void *
  25 IFUNC_SELECTOR (void)
  26 {
  27   const struct cpu_features* cpu_features = __get_cpu_features ();
  28 
  29   if (CPU_FEATURE_USABLE_P (cpu_features, FMA)
  30       && CPU_FEATURE_USABLE_P (cpu_features, AVX2))
  31     return OPTIMIZE (fma);
  32 
  33   return OPTIMIZE (sse2);
  34 }
"""

So the ultimate question is: whether your CPU and kernel support AVX2+FMA.

Comment 14 Sven E. 2020-09-03 21:48:39 UTC

Created attachment 658222 [details]
cpuid.txt.bz2

cpuid

Comment 15 Sven E. 2020-09-03 21:48:58 UTC

Created attachment 658224 [details]
cpuid.raw.txt.bz2

cpuid -raw

Comment 16 Sven E. 2020-09-03 21:51:31 UTC

/proc cpuinfo says:
model name      : Intel(R) Xeon(R) CPU E5-2650L v4 @ 1.70GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl tsc_reliable nonstop_tsc pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx hypervisor lahf_lm kaiser arat

cpuid identifies as haswell, however that Xeon CPU is sandy bridge uarch, if the research I did earlier is correct.

Comment 17 Sergei Trofimovich (RETIRED) gentoo-dev

2020-09-03 21:53:51 UTC

(In reply to Sergei Trofimovich from comment #13)
> (In reply to Sven E. from comment #12)
> > Okay, the minimalistic example I wrote in C calling pow indeed gets SIGILL
> > aswell:
> > 
> > Program received signal SIGILL, Illegal instruction.
> > 0x00007ffff7f09868 in ?? () from /lib64/libm.so.6
> > 
> > (gdb) bt
> > #0  0x00007ffff7f09868 in ?? () from /lib64/libm.so.6
> > #1  0x00007ffff7ec3484 in powf64 () from /lib64/libm.so.6
> > #2  0x00005555555551df in main () at math.c:8
> > (gdb) disassemble 0x00007ffff7f09860,0x00007ffff7f098ff
> > Dump of assembler code from 0x7ffff7f09860 to 0x7ffff7f098ff:
> >    0x00007ffff7f09860:  add    %rcx,%rdx
> >    0x00007ffff7f09863:  vmovq  %rax,%xmm6
> > => 0x00007ffff7f09868:  vfmaddsd %xmm4,0x8(%rdx),%xmm6,%xmm0
> >    0x00007ffff7f0986f:  vmovsd 0x594e9(%rip),%xmm6        # 0x7ffff7f62d60
> >    0x00007ffff7f09877:  vmulsd 0x594f1(%rip),%xmm0,%xmm9        #
> > 0x7ffff7f62d70
> >    0x00007ffff7f0987f:  vfmaddsd 0x18(%rdx),%xmm6,%xmm2,%xmm3
> >    0x00007ffff7f09886:  vmovsd 0x594da(%rip),%xmm6        # 0x7ffff7f62d68
> > 
> > This would be an FMA4 Instruction, if I am not mistaken.
> 
> Yeah, it's a 4-operand FMA. That means glibc mistakenly detects your CPU as
> capable of AVX+FMA4 and enable that implementation for trigonometry.
> 
> There are a few moving parts here:
> 1. basic kernel needs support for AVX context save/restore
> 2. glibc cpuid detection of features on your CPU
> 
> Can you install sys-apps/cpuid and upload output of 'cupid' and 'cpuid
> --raw'? It should report a bunch of leaf values and might ease tracing
> through CPU features detections.
> 
> Or you can try yourself. glibc detects features at:
> 
> sysdeps/x86/cpu-features.c:init_cpu_features():
> 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86/cpu-features.c;
> h=b0ded20486f299535fa3cbcb2f9021aaf3ab8503;hb=HEAD#l359
> 
> The specific things we are looking for are bits that enable fma
> implementation.
> 
> I think it is a
> 
> sysdeps/x86_64/fpu/multiarch/e_powf.c:

Correction: having looked at your libm.so the crash happens at __ieee754_pow_fma4. That is sysdeps/x86_64/fpu/multiarch/e_pow.c with fma4 ifunc:

"""
static inline void *
IFUNC_SELECTOR (void)
{
  const struct cpu_features* cpu_features = __get_cpu_features ();

  if (CPU_FEATURE_USABLE_P (cpu_features, FMA)
      && CPU_FEATURE_USABLE_P (cpu_features, AVX2))
    return OPTIMIZE (fma);

  if (CPU_FEATURE_USABLE_P (cpu_features, FMA))
    return OPTIMIZE (fma4);

  return OPTIMIZE (sse2);
}
"""

And looks like we have a bug here. It should be a 'if (CPU_FEATURE_USABLE_P (cpu_features, FMA4))' ina  second condition.

Comment 18 Sven E. 2020-09-03 22:01:09 UTC

Yes, I think you are right, this looks plain wrong.

One question though:
Correction: having looked at your libm.so the crash happens at __ieee754_pow_fma4. That is sysdeps/x86_64/fpu/multiarch/e_pow.c with fma4 ifunc:

How did you find this out, if I may ask?

Comment 19 Sergei Trofimovich (RETIRED) gentoo-dev

2020-09-03 22:01:23 UTC

Was fixed upstream. If you are feeling brave you can try upstream patch:
    https://sourceware.org/git/?p=glibc.git;a=patch;h=23af890b3f04e80da783ba64e6b6d94822e01d54

You will need to drop it with .patch extension to /etc/portage/patches/sys-libs/glibc and rebuild glibc.

Comment 20 Sergei Trofimovich (RETIRED) gentoo-dev

2020-09-03 22:06:12 UTC

(In reply to Sven E. from comment #18)
> Yes, I think you are right, this looks plain wrong.
> 
> One question though:
> Correction: having looked at your libm.so the crash happens at
> __ieee754_pow_fma4. That is sysdeps/x86_64/fpu/multiarch/e_pow.c with fma4
> ifunc:
> 
> How did you find this out, if I may ask?

I cheated a bit and built glibc locally with the same CFLAGS you were using (I only added -ggdb3 on top). Then searched for 'vfmaddsd %xmm4,0x8(%rdx),%xmm6,%xmm0' instruction sequence you had in gdb output and was lucky to have the same snippet. Instruction offset matched perfectly as well.

Debugging symbols make it more obvious and show that the instruction is part of __ieee754_pow_fma4 function. I think you would see the same with -ggdb3 in CFLAGS.

Comment 21 Sven E. 2020-09-03 22:27:03 UTC

Ah, thanks, that explains. For the sake of completeness I just built glibc with FEATURES="nostrip" and then gdb does indeed display the name too in the backtrace.

Should have thought of that earlier :-/.

---

Will this be taken upstream?

Comment 22 Sergei Trofimovich (RETIRED) gentoo-dev

2020-09-03 22:35:23 UTC

(In reply to Sven E. from comment #21)
> Will this be taken upstream?

The patch above mentions existing https://sourceware.org/PR26534

Comment 23 Sergei Trofimovich (RETIRED) gentoo-dev

2020-09-03 22:53:30 UTC

Queued into 2.32 patchset as: https://gitweb.gentoo.org/fork/glibc.git/commit/?h=gentoo/2.32&id=5752df8c01162de92e83a031f61e4441b4ea432b

Comment 24 Sven E. 2020-09-03 23:36:56 UTC

Thanks for your efforts, inbetween I had already done a fast patch (mayself) and dropped it in as user patch.

Since it is identical, we can close this as soon as you are done with rolling it out.

Comment 25 Massimo Burcheri 2020-09-25 11:15:12 UTC

*** Bug 744586 has been marked as a duplicate of this bug. ***

Comment 26 Larry the Git Cow gentoo-dev

2020-09-25 18:54:43 UTC

The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=55104ab0a33759928f0cb6bb8edc9a39dc3f5079

commit 55104ab0a33759928f0cb6bb8edc9a39dc3f5079
Author:     Andreas K. Hüttel <dilfridge@gentoo.org>
AuthorDate: 2020-09-25 18:53:13 +0000
Commit:     Andreas K. Hüttel <dilfridge@gentoo.org>
CommitDate: 2020-09-25 18:54:25 +0000

    sys-libs/glibc: Revbump to 2.32 patchset 2
    
    Contains the following fix:
      x86-64: Fix FMA4 detection in ifunc [BZ #26534]
    
    Closes: https://bugs.gentoo.org/740110
    Package-Manager: Portage-3.0.4, Repoman-3.0.1
    Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>

 sys-libs/glibc/Manifest             |    1 +
 sys-libs/glibc/glibc-2.32-r2.ebuild | 1505 +++++++++++++++++++++++++++++++++++
 2 files changed, 1506 insertions(+)