Summary: | dev-lang/ruby-3.2.6-r3: Fails to compile with lib/cgi/util.rb:93: [BUG] Segmentation fault at 0xfffffffffffffff8 | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | bajcsielias78 |
Component: | Current packages | Assignee: | Gentoo Ruby Team <ruby> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | bajcsielias78 |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | AMD64 | ||
OS: | Linux | ||
See Also: |
https://bugs.gentoo.org/show_bug.cgi?id=150413 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84521 https://bugs.ruby-lang.org/issues/13758 https://bugs.ruby-lang.org/issues/14480 https://bugs.gentoo.org/show_bug.cgi?id=633422 https://bugs.gentoo.org/show_bug.cgi?id=949124 |
||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
Build log file
emerge --info stack trace Valgrind stack trace new build log with valgrand Valrgind environment Valgrind new build noommitfp valgrind idk :) |
Created attachment 917853 [details]
emerge --info
Did this version build for you before? If so, when? Could you perhaps give qlop -v output between whenever it last built and now, so we can see what packages got updated? Also, how about with distcc off? > Did this version build for you before? The r3 version, no, just plain ruby-3.2.6. > Also, how about with distcc off? Guess what? After trying to update ruby several times (the last time it built was about 1 month ago or so), it finally did, but not before syncing the edgets overlay which I only added yesterday and had to sync it 3 times already... makes sense /s. Just to clarify, I didn't disable distcc and ruby and it's deps come from ::gentoo. So perhaps there's an invalid memory access, like a race condition since it seems to compile by multiple threads using their own build system or something. (Not a ruby expert btw) Maybe flaky hardware? A memory test might turn something up. (In reply to Mike Gilbert from comment #5) > Maybe flaky hardware? A memory test might turn something up. Nope, just did a full memtest and it passed. Although I should do multiple tests to be 100% concludary, the RAM sticks are still pretty new and robust. If you have 32GB of RAM, you can't do a thorough enough test in that timeframe ;) I also saw this report https://bugs.gentoo.org/932849 but since it had different filenames when the segfault appeared, I thought (and still partially do) that this is a different issue. But at the same time, if there is a race condition, perhaps they are the same issues, no matter which filename it uses. I don't know how to interpret it. (In reply to Sam James from comment #7) > If you have 32GB of RAM, you can't do a thorough enough test in that > timeframe ;) One can only do so much :) I generally recommend running it overnight. You can also try running the failing command under Valgrind inside the build directory, hopefully it was: ./miniruby -I./lib -I. -I.ext/common ./tool/generic_erb.rb -o builtin_binary.inc \ ./template/builtin_binary.inc.tmpl -- --cross=no Had to restart in order to run memtest, so the temp dirs got deleted. But I started building ruby again, and not long after that, it crashed with the same filename: /var/tmp/portage/dev-lang/ruby-3.2.6-r2/work/ruby-3.2.6/lib/cgi/util.rb:93: [BUG] Segmentation fault at 0xfffffffffffffff8 As per the command you've provided, here's the output in the attachement: Created attachment 917883 [details]
stack trace
(In reply to bajcsielias78 from comment #13) > Created attachment 917883 [details] > stack trace Nice. Can you run it again under valgrind? (just prefix the command w/ 'valgrind', so valgrind ./miniruby ...) > Nice. Can you run it again under valgrind? (just prefix the command w/
> 'valgrind', so valgrind ./miniruby ...)
My bad, I didn't know what valgrind was until literally 3 minutes ago.
Created attachment 917886 [details]
Valgrind stack trace
Gah. The garbage collection stuff is either noise or a real problem that is far beyond my ability to help. Can you try again with USE=valgrind on Ruby, and also debugging symbols enabled (see https://wiki.gentoo.org/wiki/Debugging#Per-package)? USE=valgrind on Ruby should mean it has suppressions for the GC noise. Created attachment 917889 [details]
new build log with valgrand
Created attachment 917890 [details]
Valrgind environment
Okay, done. Do I have to strip any debug symbols with gdb by any chance, or this is enough? What you've done is enough, but it looks like we need to run another command under Valgrind, as that last log looks fine.
make[2]: Entering directory '/var/tmp/portage/dev-lang/ruby-3.2.6-r2/work/ruby-3.2.6/ext/rbconfig/sizeof'
../../../miniruby -I'../../..' -I'../../.././lib' -I'../../../.ext/x86_64-linux' -I'../../../.ext/common' ../../.././tool/generic_erb.rb --output=sizes.c \
../../.././template/sizes.c.tmpl \
../../.././configure.ac \
../../.././ext/rbconfig/sizeof/extconf.rb
Try that one? So..
cd /var/tmp/portage/dev-lang/ruby-3.2.6-r2/work/ruby-3.2.6/ext/rbconfig/sizeof
valgrind ../../../miniruby -I'../../..' -I'../../.././lib' -I'../../../.ext/x86_64-linux' -I'../../../.ext/common' ../../.././tool/generic_erb.rb --output=sizes.c \
../../.././template/sizes.c.tmpl \
../../.././configure.ac \
../../.././ext/rbconfig/sizeof/extconf.rb
But it's really weird that it's now a different file and also the error is:
> /var/tmp/portage/dev-lang/ruby-3.2.6-r2/work/ruby-3.2.6/.ext/x86_64-linux/cgi/escape.so: [BUG] Illegal instruction at 0x000055bd7c9b1e40
I'm afraid I do have to say again that there's a real chance of this ending up being a hardware problem. We do see it every so often (in fact, just last week someone built a new PC, and it turned out to be a HW problem). But let's see.
Created attachment 917891 [details]
Valgrind new
``` ==32703== ==32703== Warning: client switching stacks? SP change: 0x1ffe8020d0 --> 0x1fff0000f0 ==32703== to suppress, use: --max-stackframe=8380448 or greater vex amd64->IR: unhandled instruction bytes: 0xF3 0x48 0xF 0xAE 0xEE 0x48 0x2D 0xFF 0x0 0x0 vex amd64->IR: REX=1 REX.W=1 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=1 ==32703== valgrind: Unrecognised instruction at address 0x1a5e40. ==32703== at 0x1A5E40: rb_ec_tag_jump (eval_intern.h:162) ==32703== by 0x1ABC17: rb_longjmp (eval.c:664) ==32703== by 0x1ABDB3: rb_exc_exception (eval.c:677) ==32703== by 0x1ABDD8: rb_exc_raise (eval.c:690) ==32703== by 0x1A1CD4: raise_loaderror (error.c:3165) ==32703== by 0x1A48CC: rb_loaderror (error.c:3177) ==32703== by 0x1293C7: dln_load (dmydln.c:7) ==32703== by 0x32C2C2: rb_vm_call_cfunc (vm.c:2679) ==32703== by 0x1FFAF8: require_internal (load.c:1223) ==32703== by 0x1FFD0C: rb_require_string_internal (load.c:1316) ==32703== by 0x200322: rb_require_string (load.c:1309) ==32703== by 0x32060B: vm_call_cfunc_with_frame (vm_insnhelper.c:3287) ==32703== Your program just tried to execute an instruction that Valgrind ==32703== did not recognise. There are two possible reasons for this. ==32703== 1. Your program has a bug and erroneously jumped to a non-code ==32703== location. If you are running Memcheck and you just saw a ==32703== warning about a bad jump, it's probably your program's fault. ==32703== 2. The instruction is legitimate but Valgrind doesn't handle it, ==32703== i.e. it's Valgrind's fault. If you think this is the case or ==32703== you are not sure, please let us know and we'll try to fix it. ==32703== Either way, Valgrind will now raise a SIGILL signal which will ==32703== probably kill your program. /var/tmp/portage/dev-lang/ruby-3.2.6-r2/work/ruby-3.2.6/.ext/x86_64-linux/cgi/escape.so: [BUG] Illegal instruction at 0x00000000001a5e40 ruby 3.2.6 (2024-10-30 revision 63aeb018eb) [x86_64-linux] ``` The illegal instruction under Valgrind is different from the thing I pasted above. I suspect it jumps to garbage (and then Valgrind sees it's garbage/unknown and dies because it can't decode it). The longjmp is suspicious. I find it interesting that this is so reproducible for you. Tomorrow I'll try your *FLAGS and see if I can hit it. One thing for you to try: can you try -fno-omit-frame-pointer in *FLAGS too? Notably, it does:
> checking for setjmp type... [33;1m__builtin_setjmp[m
EXTRA_ECONF="--with-setjmp-type=setjmp" may or may not help.
It might be completely unrelated to this issue I've found though, just can't ignore that the crash has longjmp and then gibberish (which is a not-uncommon bug).
(In reply to Sam James from comment #24) > Notably, it does: > > checking for setjmp type... [33;1m__builtin_setjmp[m > > EXTRA_ECONF="--with-setjmp-type=setjmp" may or may not help. > > It might be completely unrelated to this issue I've found though, just can't > ignore that the crash has longjmp and then gibberish (which is a > not-uncommon bug). Ah! In https://bugzilla.redhat.com/show_bug.cgi?id=1545239#c46, Jakub explains it wasn't really specific to arm64 and just luck, which sort of explains the bit I was worried about. Created attachment 917892 [details]
build noommitfp
Created attachment 917893 [details]
valgrind idk :)
``` ==18014== Invalid write of size 8 ==18014== at 0x33BEEC: vm_exec_handle_exception (vm.c:2583) ==18014== by 0x33BEEC: rb_vm_exec (vm.c:2381) ==18014== Address 0xfffffffffffffff8 is not stack'd, malloc'd or (recently) free'd ``` is absolutely where we go wrong, the question is why it even ends up handling an exception to begin with. Try: EXTRA_ECONF="--with-setjmp-type=setjmp" emerge -v1 ... (In reply to bajcsielias78 from comment #27) > Created attachment 917893 [details] > valgrind idk :) This one's w/o the EXTRA_ECONF feature btw > Try: EXTRA_ECONF="--with-setjmp-type=setjmp" emerge -v1 ...
I added this and built successfully 5 times in a row. I'd call it the fix.
Excellent! Thanks for persevering. While at it, we should also fix 'filter-flags -fomit-frame-pointer' to be 'append-flags -fno-omit-frame-pointer' (given it's implied by -O* on various arches, right now it's doing nothing.. or get rid of it). (In reply to Sam James from comment #31) > Excellent! Thanks for persevering. No problem. It's actually the first time I had fun in talking with the "support chat". Wish you well! - Elias The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=950f851501f6dd30c32054048ca3b4af5dcda591 commit 950f851501f6dd30c32054048ca3b4af5dcda591 Author: Sam James <sam@gentoo.org> AuthorDate: 2025-01-30 00:08:07 +0000 Commit: Sam James <sam@gentoo.org> CommitDate: 2025-01-30 00:08:07 +0000 dev-lang/ruby: disable dangerous __builtin_setjmp; fixup FP filtering * Disable dangerous __builtin_setjmp. As discussed it in the bug, it really shouldn't be used pretty much ever - rather setjmp should be used. Ruby upstream are already disabling it for arm64 and others have pointed out that it should be done for all arches, but that hasn't happened yet. Anyway, a user hit the crash, so let's make the change on our end. * Fix -fno-omit-frame-pointer filtering. For quite some time, -O* on various arches already implies -fomit-frame-pointer, hence filtering -fomit-frame-pointer by itself isn't sufficient. Add an explicit `append-flags -fno-omit-frame-pointer` to get the desired effect. We can drop it entirely if desired but I'm not confident in doing that at this point. Closes: https://bugs.gentoo.org/949016 Signed-off-by: Sam James <sam@gentoo.org> dev-lang/ruby/ruby-3.1.6-r3.ebuild | 289 +++++++++++++++++++++++++++++++++ dev-lang/ruby/ruby-3.2.6-r4.ebuild | 295 ++++++++++++++++++++++++++++++++++ dev-lang/ruby/ruby-3.3.7-r1.ebuild | 302 +++++++++++++++++++++++++++++++++++ dev-lang/ruby/ruby-3.4.1-r1.ebuild | 316 +++++++++++++++++++++++++++++++++++++ 4 files changed, 1202 insertions(+) |
Created attachment 917852 [details] Build log file Ruby nolonger compiles and seems to have an issue in ruby-3.2.6/lib/cgi/util.rb:93: [BUG] Segmentation fault at 0xfffffffffffffff8