Bug 949016

Summary:	dev-lang/ruby-3.2.6-r3: Fails to compile with lib/cgi/util.rb:93: [BUG] Segmentation fault at 0xfffffffffffffff8
Product:	Gentoo Linux	Reporter:	bajcsielias78
Component:	Current packages	Assignee:	Gentoo Ruby Team <ruby>
Status:	RESOLVED FIXED
Severity:	normal	CC:	bajcsielias78
Priority:	Normal
Version:	unspecified
Hardware:	AMD64
OS:	Linux
See Also:	https://bugs.gentoo.org/show_bug.cgi?id=150413 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84521 https://bugs.ruby-lang.org/issues/13758 https://bugs.ruby-lang.org/issues/14480 https://bugs.gentoo.org/show_bug.cgi?id=633422 https://bugs.gentoo.org/show_bug.cgi?id=949124
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	Build log file emerge --info stack trace Valgrind stack trace new build log with valgrand Valrgind environment Valgrind new build noommitfp valgrind idk :)

Description bajcsielias78 2025-01-29 16:55:03 UTC

Created attachment 917852 [details]
Build log file

Ruby nolonger compiles and seems to have an issue in ruby-3.2.6/lib/cgi/util.rb:93: [BUG] Segmentation fault at 0xfffffffffffffff8

Comment 1 bajcsielias78 2025-01-29 16:56:40 UTC

Created attachment 917853 [details]
emerge --info

Comment 2 Sam James archtester

2025-01-29 17:32:22 UTC

Did this version build for you before? If so, when? Could you perhaps give qlop -v output between whenever it last built and now, so we can see what packages got updated?

Comment 3 Sam James archtester

2025-01-29 17:32:36 UTC

Also, how about with distcc off?

Comment 4 bajcsielias78 2025-01-29 18:49:07 UTC

> Did this version build for you before?
The r3 version, no, just plain ruby-3.2.6.

> Also, how about with distcc off?
Guess what? After trying to update ruby several times (the last time it built was about 1 month ago or so), it finally did, but not before syncing the edgets overlay which I only added yesterday and had to sync it 3 times already... makes sense /s.

Just to clarify, I didn't disable distcc and ruby and it's deps come from ::gentoo.

So perhaps there's an invalid memory access, like a race condition since it seems to compile by multiple threads using their own build system or something. (Not a ruby expert btw)

Comment 5 Mike Gilbert gentoo-dev

2025-01-29 19:18:50 UTC

Maybe flaky hardware? A memory test might turn something up.

Comment 6 bajcsielias78 2025-01-29 20:39:48 UTC

(In reply to Mike Gilbert from comment #5)
> Maybe flaky hardware? A memory test might turn something up.

Nope, just did a full memtest and it passed. Although I should do multiple tests to be 100% concludary, the RAM sticks are still pretty new and robust.

Comment 7 Sam James archtester

2025-01-29 20:42:02 UTC

If you have 32GB of RAM, you can't do a thorough enough test in that timeframe ;)

Comment 8 bajcsielias78 2025-01-29 20:48:29 UTC

I also saw this report https://bugs.gentoo.org/932849 but since it had different filenames when the segfault appeared, I thought (and still partially do) that this is a different issue.

But at the same time, if there is a race condition, perhaps they are the same issues, no matter which filename it uses.

I don't know how to interpret it.

Comment 9 bajcsielias78 2025-01-29 20:48:53 UTC

(In reply to Sam James from comment #7)
> If you have 32GB of RAM, you can't do a thorough enough test in that
> timeframe ;)

One can only do so much :)

Comment 10 Sam James archtester

2025-01-29 21:12:01 UTC

I generally recommend running it overnight.

Comment 11 Sam James archtester

2025-01-29 21:14:08 UTC

You can also try running the failing command under Valgrind inside the build directory, hopefully it was:
./miniruby -I./lib -I. -I.ext/common  ./tool/generic_erb.rb -o builtin_binary.inc \
	./template/builtin_binary.inc.tmpl -- --cross=no

Comment 12 bajcsielias78 2025-01-29 21:27:08 UTC

Had to restart in order to run memtest, so the temp dirs got deleted.

But I started building ruby again, and not long after that, it crashed with the same filename:

/var/tmp/portage/dev-lang/ruby-3.2.6-r2/work/ruby-3.2.6/lib/cgi/util.rb:93: [BUG] Segmentation fault at 0xfffffffffffffff8

As per the command you've provided, here's the output in the attachement:

Comment 13 bajcsielias78 2025-01-29 21:28:16 UTC

Created attachment 917883 [details]
stack trace

Comment 14 Sam James archtester

2025-01-29 21:28:54 UTC

(In reply to bajcsielias78 from comment #13)
> Created attachment 917883 [details]
> stack trace

Nice. Can you run it again under valgrind? (just prefix the command w/ 'valgrind', so valgrind ./miniruby ...)

Comment 15 bajcsielias78 2025-01-29 21:47:32 UTC

> Nice. Can you run it again under valgrind? (just prefix the command w/
> 'valgrind', so valgrind ./miniruby ...)

My bad, I didn't know what valgrind was until literally 3 minutes ago.

Comment 16 bajcsielias78 2025-01-29 21:47:58 UTC

Created attachment 917886 [details]
Valgrind stack trace

Comment 17 Sam James archtester

2025-01-29 21:59:35 UTC

Gah. The garbage collection stuff is either noise or a real problem that is far beyond my ability to help.

Can you try again with USE=valgrind on Ruby, and also debugging symbols enabled (see https://wiki.gentoo.org/wiki/Debugging#Per-package)?

USE=valgrind on Ruby should mean it has suppressions for the GC noise.

Comment 18 bajcsielias78 2025-01-29 22:39:44 UTC

Created attachment 917889 [details]
new build log with valgrand

Comment 19 bajcsielias78 2025-01-29 22:41:10 UTC

Created attachment 917890 [details]
Valrgind environment

Comment 20 bajcsielias78 2025-01-29 22:42:07 UTC

Okay, done. Do I have to strip any debug symbols with gdb by any chance, or this is enough?

Comment 21 Sam James archtester

2025-01-29 22:44:01 UTC

What you've done is enough, but it looks like we need to run another command under Valgrind, as that last log looks fine.


make[2]: Entering directory '/var/tmp/portage/dev-lang/ruby-3.2.6-r2/work/ruby-3.2.6/ext/rbconfig/sizeof'
../../../miniruby -I'../../..' -I'../../.././lib' -I'../../../.ext/x86_64-linux' -I'../../../.ext/common' ../../.././tool/generic_erb.rb --output=sizes.c \
	../../.././template/sizes.c.tmpl \
	../../.././configure.ac \
	../../.././ext/rbconfig/sizeof/extconf.rb

Try that one? So..

cd /var/tmp/portage/dev-lang/ruby-3.2.6-r2/work/ruby-3.2.6/ext/rbconfig/sizeof
valgrind ../../../miniruby -I'../../..' -I'../../.././lib' -I'../../../.ext/x86_64-linux' -I'../../../.ext/common' ../../.././tool/generic_erb.rb --output=sizes.c \
	../../.././template/sizes.c.tmpl \
	../../.././configure.ac \
	../../.././ext/rbconfig/sizeof/extconf.rb

But it's really weird that it's now a different file and also the error is:

> /var/tmp/portage/dev-lang/ruby-3.2.6-r2/work/ruby-3.2.6/.ext/x86_64-linux/cgi/escape.so: [BUG] Illegal instruction at 0x000055bd7c9b1e40

I'm afraid I do have to say again that there's a real chance of this ending up being a hardware problem. We do see it every so often (in fact, just last week someone built a new PC, and it turned out to be a HW problem). But let's see.

Comment 22 bajcsielias78 2025-01-29 22:47:31 UTC

Created attachment 917891 [details]
Valgrind new

Comment 23 Sam James archtester

2025-01-29 22:50:08 UTC

```
==32703== 
==32703== Warning: client switching stacks?  SP change: 0x1ffe8020d0 --> 0x1fff0000f0
==32703==          to suppress, use: --max-stackframe=8380448 or greater
vex amd64->IR: unhandled instruction bytes: 0xF3 0x48 0xF 0xAE 0xEE 0x48 0x2D 0xFF 0x0 0x0
vex amd64->IR:   REX=1 REX.W=1 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=1
==32703== valgrind: Unrecognised instruction at address 0x1a5e40.
==32703==    at 0x1A5E40: rb_ec_tag_jump (eval_intern.h:162)
==32703==    by 0x1ABC17: rb_longjmp (eval.c:664)
==32703==    by 0x1ABDB3: rb_exc_exception (eval.c:677)
==32703==    by 0x1ABDD8: rb_exc_raise (eval.c:690)
==32703==    by 0x1A1CD4: raise_loaderror (error.c:3165)
==32703==    by 0x1A48CC: rb_loaderror (error.c:3177)
==32703==    by 0x1293C7: dln_load (dmydln.c:7)
==32703==    by 0x32C2C2: rb_vm_call_cfunc (vm.c:2679)
==32703==    by 0x1FFAF8: require_internal (load.c:1223)
==32703==    by 0x1FFD0C: rb_require_string_internal (load.c:1316)
==32703==    by 0x200322: rb_require_string (load.c:1309)
==32703==    by 0x32060B: vm_call_cfunc_with_frame (vm_insnhelper.c:3287)
==32703== Your program just tried to execute an instruction that Valgrind
==32703== did not recognise.  There are two possible reasons for this.
==32703== 1. Your program has a bug and erroneously jumped to a non-code
==32703==    location.  If you are running Memcheck and you just saw a
==32703==    warning about a bad jump, it's probably your program's fault.
==32703== 2. The instruction is legitimate but Valgrind doesn't handle it,
==32703==    i.e. it's Valgrind's fault.  If you think this is the case or
==32703==    you are not sure, please let us know and we'll try to fix it.
==32703== Either way, Valgrind will now raise a SIGILL signal which will
==32703== probably kill your program.
/var/tmp/portage/dev-lang/ruby-3.2.6-r2/work/ruby-3.2.6/.ext/x86_64-linux/cgi/escape.so: [BUG] Illegal instruction at 0x00000000001a5e40
ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [x86_64-linux]
```

The illegal instruction under Valgrind is different from the thing I pasted above. I suspect it jumps to garbage (and then Valgrind sees it's garbage/unknown and dies because it can't decode it). The longjmp is suspicious.

I find it interesting that this is so reproducible for you. Tomorrow I'll try your *FLAGS and see if I can hit it.

One thing for you to try: can you try -fno-omit-frame-pointer in *FLAGS too?

Comment 24 Sam James archtester

2025-01-29 22:53:19 UTC

Notably, it does:
> checking for setjmp type... [33;1m__builtin_setjmp[m

EXTRA_ECONF="--with-setjmp-type=setjmp" may or may not help.

It might be completely unrelated to this issue I've found though, just can't ignore that the crash has longjmp and then gibberish (which is a not-uncommon bug).

Comment 25 Sam James archtester

2025-01-29 22:54:57 UTC

(In reply to Sam James from comment #24)
> Notably, it does:
> > checking for setjmp type... [33;1m__builtin_setjmp[m
> 
> EXTRA_ECONF="--with-setjmp-type=setjmp" may or may not help.
> 
> It might be completely unrelated to this issue I've found though, just can't
> ignore that the crash has longjmp and then gibberish (which is a
> not-uncommon bug).

Ah! In https://bugzilla.redhat.com/show_bug.cgi?id=1545239#c46, Jakub explains it wasn't really specific to arm64 and just luck, which sort of explains the bit I was worried about.

Comment 26 bajcsielias78 2025-01-29 22:57:20 UTC

Created attachment 917892 [details]
build noommitfp

Comment 27 bajcsielias78 2025-01-29 22:59:50 UTC

Created attachment 917893 [details]
valgrind idk :)

Comment 28 Sam James archtester

2025-01-29 23:01:08 UTC

```
==18014== Invalid write of size 8
==18014==    at 0x33BEEC: vm_exec_handle_exception (vm.c:2583)
==18014==    by 0x33BEEC: rb_vm_exec (vm.c:2381)
==18014==  Address 0xfffffffffffffff8 is not stack'd, malloc'd or (recently) free'd
```

is absolutely where we go wrong, the question is why it even ends up handling an exception to begin with.

Try: EXTRA_ECONF="--with-setjmp-type=setjmp" emerge -v1 ...

Comment 29 bajcsielias78 2025-01-29 23:01:53 UTC

(In reply to bajcsielias78 from comment #27)
> Created attachment 917893 [details]
> valgrind idk :)

This one's w/o the EXTRA_ECONF feature btw

Comment 30 bajcsielias78 2025-01-29 23:28:08 UTC

> Try: EXTRA_ECONF="--with-setjmp-type=setjmp" emerge -v1 ...

I added this and built successfully 5 times in a row. I'd call it the fix.

Comment 31 Sam James archtester

2025-01-29 23:31:44 UTC

Excellent! Thanks for persevering.

Comment 32 Sam James archtester

2025-01-29 23:33:56 UTC

While at it, we should also fix 'filter-flags -fomit-frame-pointer' to be 'append-flags -fno-omit-frame-pointer' (given it's implied by -O* on various arches, right now it's doing nothing.. or get rid of it).

Comment 33 bajcsielias78 2025-01-29 23:46:43 UTC

(In reply to Sam James from comment #31)
> Excellent! Thanks for persevering.

No problem. It's actually the first time I had fun in talking with the "support chat".

Wish you well!
- Elias

Comment 34 Larry the Git Cow gentoo-dev

2025-01-30 00:10:49 UTC

The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=950f851501f6dd30c32054048ca3b4af5dcda591

commit 950f851501f6dd30c32054048ca3b4af5dcda591
Author:     Sam James <sam@gentoo.org>
AuthorDate: 2025-01-30 00:08:07 +0000
Commit:     Sam James <sam@gentoo.org>
CommitDate: 2025-01-30 00:08:07 +0000

    dev-lang/ruby: disable dangerous __builtin_setjmp; fixup FP filtering
    
    * Disable dangerous __builtin_setjmp. As discussed it in the bug, it
      really shouldn't be used pretty much ever - rather setjmp should be used.
    
      Ruby upstream are already disabling it for arm64 and others have pointed
      out that it should be done for all arches, but that hasn't happened yet.
    
      Anyway, a user hit the crash, so let's make the change on our end.
    
    * Fix -fno-omit-frame-pointer filtering. For quite some time, -O* on various
      arches already implies -fomit-frame-pointer, hence filtering -fomit-frame-pointer
      by itself isn't sufficient. Add an explicit `append-flags -fno-omit-frame-pointer`
      to get the desired effect. We can drop it entirely if desired but I'm not
      confident in doing that at this point.
    
    Closes: https://bugs.gentoo.org/949016
    Signed-off-by: Sam James <sam@gentoo.org>

 dev-lang/ruby/ruby-3.1.6-r3.ebuild | 289 +++++++++++++++++++++++++++++++++
 dev-lang/ruby/ruby-3.2.6-r4.ebuild | 295 ++++++++++++++++++++++++++++++++++
 dev-lang/ruby/ruby-3.3.7-r1.ebuild | 302 +++++++++++++++++++++++++++++++++++
 dev-lang/ruby/ruby-3.4.1-r1.ebuild | 316 +++++++++++++++++++++++++++++++++++++
 4 files changed, 1202 insertions(+)