Summary: | games-strategy/freeorion-0.5.0.1 fails to compile with boost-1.85, then hangs at runtime if fixed | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Torsten Kaiser <Storklerk> |
Component: | Current packages | Assignee: | Gentoo Games <games> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | ionen, jstein, O01eg, Storklerk |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | https://github.com/freeorion/freeorion/issues/4949 | ||
See Also: |
https://github.com/boostorg/container/issues/281 https://github.com/boostorg/container/issues/252 https://bugs.gentoo.org/show_bug.cgi?id=933289 |
||
Whiteboard: | Workaround dep applied | ||
Package list: | Runtime testing required: | --- | |
Bug Depends on: | |||
Bug Blocks: | 915000, 930498 | ||
Attachments: |
Log from failed build
freeorion-logs.tar.xz fix for wrong types in Pathfinder.cpp |
Description
Torsten Kaiser
2024-05-26 09:47:28 UTC
Created attachment 894428 [details]
Log from failed build
Upstream Bug: https://github.com/freeorion/freeorion/issues/4897 Upstream PR: https://github.com/freeorion/freeorion/pull/4899/commits https://patch-diff.githubusercontent.com/raw/freeorion/freeorion/pull/4899.patch -> adding this to /etc/portage/patches makes the build work. Does the game work for you with boost-1.85? It builds fine with the patch, but the tests fail (hang), and if I try to run it normally with Quick Start it also seemingly hang forever. If I go back to boost-1.84.0-r3 and rebuild, all is fine (even with the patch, so it's probably not caused by the patch). (not to say there may not be something else going on, my boost-1.84 binpkg is quite old and think it was built with gcc13 -- so if it works for you maybe could look around these avenues) Test start also only produced a black screen for me. Both 'Single player' and 'Quick start'. Loading an old savegame partly worked, at least some UI came up. But it was still broken, but probably because the save was from 2018. Maybe it's https://github.com/freeorion/freeorion/issues/4151 That issue referenced another bug and there the same symptom "client window stays black" is mentioned: https://github.com/freeorion/freeorion/pull/4116#issuecomment-1257178010 I try to see, if I can find something useful wrt. the hang. But on the surface that seems more of an upstream issue than boost or gcc related. 16:22:24.559874 {0x00007f864b555780} [debug] python : empires.py:225 : Trying to find 7 home systems that are at least 1 jumps apart... 16:22:24.559895 {0x00007f864b555780} [debug] python : empires.py:229 : ...use complete pool 16:22:24.559951 {0x00007f864b555780} [debug] python : empires.py:161 : Failing in find_home_systems_for_min_jump_distance because current_merit_lower_bound = 0 trims local pool to 0 systems which is less than num_home_systems 7. 16:22:24.559981 {0x00007f864b555780} [debug] python : empires.py:246 : ...only 0 systems found 16:22:24.560006 {0x00007f864b555780} [error] python : util.py:45 : Python generate_home_system_list: requested 7 homeworlds in a galaxy with 151 systems 16:22:24.560027 {0x00007f864b555780} [error] python : util.py:45 : Python create_universe: couldn't get any home systems, ABORTING! 16:22:24.562542 {0x00007f864b555780} [error] server : PythonCommon.cpp:101 : Traceback (most recent call last): 16:22:24.562564 {0x00007f864b555780} [error] server : PythonCommon.cpp:101 : File "/usr/share/freeorion/default/python/common/listeners.py", line 42, in wrapper res = funct(*args) ^^^^^^^^^^^^ 16:22:24.562569 {0x00007f864b555780} [error] server : PythonCommon.cpp:101 : File "/usr/share/freeorion/default/python/universe_generation/universe_generator.py", line 121, in create_universe raise Exception(err_msg) 16:22:24.562574 {0x00007f864b555780} [error] server : PythonCommon.cpp:101 : Exception: Python create_universe: couldn't get any home systems, ABORTING! ... the game itself seems to be OK, but is unable to create useful AI start positions and aborts. ... and then the client just hangs with a black screen. I will retry later after removing old configs... I see. Don't really have time to look closer at this myself right now (I don't really know/play the game, I just try to keep the ebuild working), if you find something / report upstream it'd be great (may be worth trying master too). It's nasty but maybe I'll just put an upper bound on boost for now given build-time fixes won't mean much if it doesn't run. CC Oleg in case interested, maybe have an idea -- not sure if runtime been tested at all with 1.85 upstream and if it was just build fixes, if it works then guess will have to experiment more from this side. The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=2e84fcf1c3010ecf7210ab8cf115c5c9009e89ee commit 2e84fcf1c3010ecf7210ab8cf115c5c9009e89ee Author: Ionen Wolkens <ionen@gentoo.org> AuthorDate: 2024-05-26 14:49:49 +0000 Commit: Ionen Wolkens <ionen@gentoo.org> CommitDate: 2024-05-26 14:51:25 +0000 games-strategy/freeorion: limit to <boost-1.85 for now No need for revbump given binding operator sorts it out for us. Bug: https://bugs.gentoo.org/932780 Signed-off-by: Ionen Wolkens <ionen@gentoo.org> games-strategy/freeorion/freeorion-0.5.0.1.ebuild | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Logs suggest that the problem is somewhere inside universe/Pathfinder.cpp WithinJumps(). That function always seems to return an empty list and then no suitable home systems can be found during universe creation. 21:00:03.096798 {0x00007fcef1299780} [trace] server : ServerWrapper.cpp:1005 : within 3 jumps: 0 systems 21:00:03.096811 {0x00007fcef1299780} [trace] server : Pathfinder.cpp:748 : Cache MISS ii: 2 21:00:03.096822 {0x00007fcef1299780} [trace] server : Pathfinder.cpp:1178 : Cache Hit ii: 2 jumps: 3 21:00:03.096826 {0x00007fcef1299780} [trace] server : ServerWrapper.cpp:1005 : within 3 jumps: 0 systems 21:00:03.096839 {0x00007fcef1299780} [trace] server : Pathfinder.cpp:748 : Cache MISS ii: 3 21:00:03.096848 {0x00007fcef1299780} [trace] server : Pathfinder.cpp:1178 : Cache Hit ii: 3 jumps: 3 21:00:03.096852 {0x00007fcef1299780} [trace] server : ServerWrapper.cpp:1005 : within 3 jumps: 0 systems 21:00:03.096864 {0x00007fcef1299780} [trace] server : Pathfinder.cpp:748 : Cache MISS ii: 4 21:00:03.096873 {0x00007fcef1299780} [trace] server : Pathfinder.cpp:1178 : Cache Hit ii: 4 jumps: 3 21:00:03.096877 {0x00007fcef1299780} [trace] server : ServerWrapper.cpp:1005 : within 3 jumps: 0 systems As that function does use boost (boost::breadth_first_search() in Pathfinder::PathfinderImpl::HandleCacheMiss()) that might be related to the boost upgrade. I have not found out, what goes wrong. I will try to add more debug in the next days... (In reply to Ionen Wolkens from comment #7) > CC Oleg in case interested, maybe have an idea -- not sure if runtime been > tested at all with 1.85 upstream and if it was just build fixes, if it works > then guess will have to experiment more from this side. No, latest boost version tested is 1.83 with Debian sid, Fedora rawhide, Manjaro and Void(musl). If use latest `release-0.5` commit do tests fail too? (In reply to Oleg from comment #10) > (In reply to Ionen Wolkens from comment #7) > > CC Oleg in case interested, maybe have an idea -- not sure if runtime been > > tested at all with 1.85 upstream and if it was just build fixes, if it works > > then guess will have to experiment more from this side. > > No, latest boost version tested is 1.83 with Debian sid, Fedora rawhide, > Manjaro and Void(musl). If use latest `release-0.5` commit do tests fail too? I see, thanks. Tried release-v0.5 092ea03 built against boost-1.85.0 (unpatched given builds fine as-is now), but still getting: Start 1: TestEnumParser 1/6 Test #1: TestEnumParser ................... Passed 0.07 sec Start 2: TestPythonParser 2/6 Test #2: TestPythonParser ................. Passed 1.82 sec Start 3: SmokeTestGame (hanging there) Similarly hangs if try manually w/ Quick Start as noted in other comments, first see a black screen with high cpu usage seemingly preparing things but then usage stops completely and screen remains black. If go back to 1.84.0, 092ea03 passes tests fine. ftr also using python3.12, albeit given it works with 1.84 I assume it's unrelated (could try py3.10-11 if need be though). Being a ~testing system everything is somewhat recent (incl. gcc14), typical gentoo stable systems are still on boost-1.84. I'm trying to reproduce failure in Docker, but not sure what also could be broken ``` 10:43:21.053823 {0x00007f0a16e5d380} [info] server : ServerApp.cpp:101 : v0.5.0.1 [build 2024-04-10.092ea03] CMake 10:43:21.053841 {0x00007f0a16e5d380} [info] server : DependencyVersions.cpp:85 : Dependency versions from headers: 10:43:21.053856 {0x00007f0a16e5d380} [info] server : DependencyVersions.cpp:88 : Boost: 1_85 10:43:21.053864 {0x00007f0a16e5d380} [info] server : DependencyVersions.cpp:88 : Python: 3.11.9 10:43:21.053869 {0x00007f0a16e5d380} [info] server : DependencyVersions.cpp:88 : zlib: 1.3.1 ``` And ctest outputs: ``` Test project /tmp/build Start 1: TestEnumParser 1/5 Test #1: TestEnumParser ................... Passed 0.06 sec Start 2: TestPythonParser 2/5 Test #2: TestPythonParser ................. Passed 1.94 sec Start 3: SmokeTestGame 3/5 Test #3: SmokeTestGame .................... Passed 18.59 sec Start 4: SmokeTestHostless 4/5 Test #4: SmokeTestHostless ................ Passed 45.88 sec Start 5: TestChecksum 5/5 Test #5: TestChecksum ..................... Passed 2.06 sec 100% tests passed, 0 tests failed out of 5 ``` Do you have logs in ~/.local/share/freeorion/freeorion.log and ~/.local/share/freeorion/freeoriond.log when it hangups? Created attachment 894513 [details] freeorion-logs.tar.xz (In reply to Oleg from comment #12) > Do you have logs in ~/.local/share/freeorion/freeorion.log and > ~/.local/share/freeorion/freeoriond.log when it hangups? Attached the whole log directory, I do see a python traceback in freeoriond.log, don't know if related: 08:18:20.725162 {0x00007ffff3e6cc00} [error] server : PythonCommon.cpp:101 : Traceback (most recent call last): 08:18:20.725177 {0x00007ffff3e6cc00} [error] server : PythonCommon.cpp:101 : File "/usr/share/freeorion/default/python/common/listeners.py", line 42, in wrapper res = funct(*args) ^^^^^^^^^^^^ 08:18:20.725182 {0x00007ffff3e6cc00} [error] server : PythonCommon.cpp:101 : File "/usr/share/freeorion/default/python/universe_generation/universe_generator.py", line 121, in create_universe raise Exception(err_msg) 08:18:20.725185 {0x00007ffff3e6cc00} [error] server : PythonCommon.cpp:101 : Exception: Python create_universe: couldn't get any home systems, ABORTING! > DependencyVersions.cpp:88 : Python: 3.11.9 Maybe I'll go ahead and try 3.11 too, maybe something that only trigger with 3.12+1.85 (In reply to Ionen Wolkens from comment #13) > Maybe I'll go ahead and try 3.11 too, maybe something that only trigger with > 3.12+1.85 Nope, still happens if linked with 3.11.9's libpython -- versions looking identical to your own test case now: 08:46:56.599233 {0x00007ffff3f83c00} [info] server : ServerApp.cpp:101 : v0.5.0.1 [build 2024-04-10.092ea03] CMake 08:46:56.599242 {0x00007ffff3f83c00} [info] server : DependencyVersions.cpp:85 : Dependency versions from headers: 08:46:56.599248 {0x00007ffff3f83c00} [info] server : DependencyVersions.cpp:88 : Boost: 1_85 08:46:56.599252 {0x00007ffff3f83c00} [info] server : DependencyVersions.cpp:88 : Python: 3.11.9 08:46:56.599254 {0x00007ffff3f83c00} [info] server : DependencyVersions.cpp:88 : zlib: 1.3.1 I'll try a few other random things on the sides in case anything gentoo-specific, but don't really have many ideas. I'd try to build boost+freeorion with gcc13 (rather than 14) but I'd need to setup a new test system given can't safely downgrade libstdc++, albeit used 14 to build the working boost-1.84+freeorion so it seems like a long shot anyway. I'm using docker image from https://github.com/gentoo/gentoo-docker-images?tab=readme-ov-file#using-the-portage-container-as-a-data-volume to experiment I've changed python version to 3.12 but tests still are passed 13:26:47.828764 {0x00007f092968e380} [info] server : ServerApp.cpp:101 : v0.5.0.1 [build 2024-04-10.092ea03] CMake 13:26:47.828779 {0x00007f092968e380} [info] server : DependencyVersions.cpp:85 : Dependency versions from headers: 13:26:47.828792 {0x00007f092968e380} [info] server : DependencyVersions.cpp:88 : Boost: 1_85 13:26:47.828806 {0x00007f092968e380} [info] server : DependencyVersions.cpp:88 : Python: 3.12.3 13:26:47.828813 {0x00007f092968e380} [info] server : DependencyVersions.cpp:88 : zlib: 1.3.1 I also opened issue in upstream https://github.com/freeorion/freeorion/issues/4949 Thanks. Guess gcc14 can't be fully ruled out, maybe there is some UB that goes wrong only with it + boost 1.85. I assume the docker image is on gcc13. (In reply to Ionen Wolkens from comment #16) > Guess gcc14 can't be fully ruled out, maybe there is some UB that goes wrong > only with it + boost 1.85. I assume the docker image is on gcc13. That theory just got stronger given tests pass if use -O0 to build freeorion itself (boost is still on -O2, albeit it could still be an issue in boost headers if not freeorion). Still haven't tried gcc-13 myself, but I do assume gcc-14 is related. Guess if could find what specific optimization cause the issue could disable it as a less invasive workaround than a upper bound. (In reply to Ionen Wolkens from comment #17) > Guess if could find what specific optimization cause the issue could disable > it as a less invasive workaround than a upper bound. Well, tried the obvious -fno-strict-aliasing first and tests pass again. Guess good enough for a downstream "fix" until something is figured out. Albeit I'll push that tomorrow after re-testing from a clean env to be sure I didn't just confuse something. The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=998fae593fd83b10e798c8fd9bbbad5b73ed7ac0 commit 998fae593fd83b10e798c8fd9bbbad5b73ed7ac0 Author: Ionen Wolkens <ionen@gentoo.org> AuthorDate: 2024-05-27 15:30:44 +0000 Commit: Ionen Wolkens <ionen@gentoo.org> CommitDate: 2024-05-27 15:51:46 +0000 games-strategy/freeorion: fix build+runtime w/ boost-1.85 Re-tested in a clean env and seems fine. Hoping -fno-sa won't be permanent, but it'll do better than an upper bound. Closes: https://bugs.gentoo.org/932780 Signed-off-by: Ionen Wolkens <ionen@gentoo.org> .../files/freeorion-0.5.0.1-boost1.85.patch | 57 ++++++++++++++++++++++ ...-0.5.0.1.ebuild => freeorion-0.5.0.1-r1.ebuild} | 8 ++- 2 files changed, 63 insertions(+), 2 deletions(-) It's not gcc-14. Building with gcc-13 fails the same way. Starlanes get generated normally, then PathFinder->WithinJumps() only returns empty lists. So no home systems can get generated and the client 'hangs'. gcc-14 and "append-flags -fno-strict-aliasing" works, I am able to start a new game. Excerpt from the failing log: 21:14:12.271351 {0x00007f3c70114780} [trace] server : System.cpp:426 : Added starlane from system (2552) system 1624 21:14:12.271356 {0x00007f3c70114780} [trace] server : System.cpp:426 : Added starlane from system (2552) system 2344 21:14:12.271361 {0x00007f3c70114780} [trace] server : System.cpp:426 : Added starlane from system (2552) system 2400 21:14:12.271367 {0x00007f3c70114780} [debug] server : UniverseGenerator.cpp:728 : Initializing System Graph 21:14:12.271498 {0x00007f3c70114780} [debug] python : universe_generator.py:113 : Starlanes generated 21:14:12.271515 {0x00007f3c70114780} [debug] python : universe_generator.py:115 : Compile list of home systems... 21:14:12.271552 {0x00007f3c70114780} [debug] python : empires.py:342 : Compile home system list: 7 systems requested 21:14:12.271569 {0x00007f3c70114780} [trace] server : ServerWrapper.cpp:1005 : within 3 jumps: 0 systems 21:14:12.271579 {0x00007f3c70114780} [trace] server : Pathfinder.cpp:748 : Cache MISS ii: 1 21:14:12.271590 {0x00007f3c70114780} [trace] server : Pathfinder.cpp:1178 : Cache Hit ii: 1 jumps: 3 21:14:12.271595 {0x00007f3c70114780} [trace] server : ServerWrapper.cpp:1005 : within 3 jumps: 0 systems 21:14:12.271603 {0x00007f3c70114780} [trace] server : Pathfinder.cpp:748 : Cache MISS ii: 2 21:14:12.271613 {0x00007f3c70114780} [trace] server : Pathfinder.cpp:1178 : Cache Hit ii: 2 jumps: 3 21:14:12.271617 {0x00007f3c70114780} [trace] server : ServerWrapper.cpp:1005 : within 3 jumps: 0 systems (In reply to Torsten Kaiser from comment #20) > It's not gcc-14. Building with gcc-13 fails the same way. > Starlanes get generated normally, then PathFinder->WithinJumps() only > returns empty lists. So no home systems can get generated and the client > 'hangs'. I see, in that case I am at a loss as to why Oleg was not able to reproduce. It seems to be caused by universe/Pathfinder.cpp. If I only recompile this file with -fno-strict-aliasing I can quickstart a new game. I found a bug in that file that looked promising, but even with this fixed its not not working, so the aliasing problem must be somewhere else in that file. The bug: https://github.com/freeorion/freeorion/commit/7fa78ab26ba3d27ac7899c038501031ed47c52ad That changed all size_t to std::size_t https://github.com/freeorion/freeorion/commit/7f88018ea720ce05612b1bd36b14e80ed5fd7173 That changed all short to int16_t But: https://github.com/freeorion/freeorion/commit/93a0968631d135c6eddc3c0d667215f4ee2312fd Re-added a few size_t and short. I will attach the fix for that, but even with this patch I still needed the -fno-strict-aliasing when building Pathfinder.cpp. Created attachment 894716 [details, diff]
fix for wrong types in Pathfinder.cpp
(In reply to Torsten Kaiser from comment #22) > It seems to be caused by universe/Pathfinder.cpp. > If I only recompile this file with -fno-strict-aliasing I can quickstart a > new game. > You can try bisect the file via pragmas: https://stackoverflow.com/a/2220565. It's a bit brittle but it should be fine for what you're doing here. You can try it with "-fno-strict-aliasing" but if that doesn't work, just do it with -O1/-O2/-O3 instead. I was just doing that. :-) The following "patch" fixes the "no home system found" problem, but the game then still gets stuck at something else. --- a/universe/Pathfinder.cpp 2024-05-31 16:01:26.974982757 +0200 +++ b/universe/Pathfinder.cpp 2024-05-31 18:30:09.934918205 +0200 @@ -1,4 +1,7 @@ +#pragma GCC push_options +#pragma GCC optimize ("no-strict-aliasing") #include "Pathfinder.h" +#pragma GCC pop_options #include <algorithm> #include <limits> This is with dev-libs/boost-1.85.0, dev-lang/python-3.12.3-r1 and sys-devel/gcc-14.1.1_p20240518. I think the boost was still build with gcc-13, but as recompiling just freeorion with and without partly strict-aliasing shows the problems that probably does not matter. I do use "unrecommended" CFLAGS: CFLAGS="-pipe -march=znver1 -O3 -fomit-frame-pointer -fivopts -fweb -frename-registers -ftracer" Next try: moving the #pragma into the Pathfinder.h, to see if that "fixes" the other hang. I'm also not sure what this really changes. It seems that Pathfinder.h does not contain any code, only the class definition. :) You can try the `may_alias` attribute too on structs (presumably classes too) but it's IMO kind of fragile... --- e/universe/UniverseObject.h 2024-05-31 20:42:56.614860571 +0200 +++ f/universe/UniverseObject.h 2024-05-31 21:58:49.574827634 +0200 @@ -5,7 +5,10 @@ #include <set> #include <string> #include <vector> +#pragma GCC push_options +#pragma GCC optimize ("no-strict-aliasing") #include <boost/container/flat_map.hpp> +#pragma GCC pop_options #include <boost/python/detail/destroy.hpp> #include <boost/signals2/optional_last_value.hpp> #include <boost/signals2/signal.hpp> ... which makes sense that some change in boost triggered the problems / the hang. But it's totally unclear where the real bug is. UniverseObject is the base class of "everything" and this patch will probably turn off strict aliasing for most of the flat_map's that are used everywhere. And this is still worse then the global "append-flags -fno-strict-aliasing", because this only fixes the generation of the home systems, but then the game client will still hang. I took a quick look at recent boost::container fixes (https://github.com/boostorg/container/commits/develop/) and there's a bunch of UB and aliasing changes, e.g. * https://github.com/boostorg/container/commit/978bbb113ab2e72f84f626d3e52039cf06501853 * https://github.com/boostorg/container/commit/03abe8c02c17e4b9876b006174375a927a1cb2cb * https://github.com/boostorg/container/commit/20ad12f20e661978e90dc7f36d8ab8ac05e5a5a9 (and more) welp, thanks for looking into this, figures this may be causing problems in more packages then I'm not sure what to do yet. The pragma is fragile but I feel like leaving boost-1.85 as-is surely isn't sustainable or safe either. Reverting the bisect result (https://github.com/boostorg/container/commit/1a4a205ea6ef7b4e67a2faab7c7d745711807695) isn't great as it's both: a) a massive commit; b) not the real fix anyway. I'll file a new bug for discussion. The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=5bcbe7a421e108bfea3b77723040f37d87741dae commit 5bcbe7a421e108bfea3b77723040f37d87741dae Author: Ionen Wolkens <ionen@gentoo.org> AuthorDate: 2024-06-01 05:28:23 +0000 Commit: Ionen Wolkens <ionen@gentoo.org> CommitDate: 2024-06-01 05:31:34 +0000 games-strategy/freeorion: limit boost workaround to 1.85.0 Been confirmed that the issue is in boost itself, so there's no reason to always do this. Odds are will be fixed if there is a -r1 too, so preemptively limit to -r0 too. Bug: https://bugs.gentoo.org/932780 Signed-off-by: Ionen Wolkens <ionen@gentoo.org> games-strategy/freeorion/freeorion-0.5.0.1-r1.ebuild | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=4bbea521152678132cf0b02e3a152481523b3078 commit 4bbea521152678132cf0b02e3a152481523b3078 Author: Sam James <sam@gentoo.org> AuthorDate: 2024-06-03 01:19:54 +0000 Commit: Sam James <sam@gentoo.org> CommitDate: 2024-06-03 01:20:45 +0000 dev-libs/boost: fix aliasing violation in boost::container Note that we have to crank the subslot for this. I've added a fudge .1 which we should drop on 1.86.0. Closes: https://bugs.gentoo.org/933289 Bug: https://github.com/freeorion/freeorion/issues/4949 Bug: https://github.com/boostorg/container/issues/252 Bug: https://github.com/boostorg/container/issues/281 Bug: https://bugs.gentoo.org/932780 Bug: https://bugs.gentoo.org/931587 Signed-off-by: Sam James <sam@gentoo.org> dev-libs/boost/boost-1.85.0-r1.ebuild | 348 ++++++++++++++++++ .../files/boost-1.85.0-container-aliasing.patch | 408 +++++++++++++++++++++ 2 files changed, 756 insertions(+) |