928339 – sci-libs/caffe2-2.2.1-r1 emerge fail with dev-util/nvidia-cuda-toolkit-12.4.0

Bug 928339 - sci-libs/caffe2-2.2.1-r1 emerge fail with dev-util/nvidia-cuda-toolkit-12.4.0

Summary: sci-libs/caffe2-2.2.1-r1 emerge fail with dev-util/nvidia-cuda-toolkit-12.4.0

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Tupone Alfredo

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2024-04-01 02:51 UTC by znjameswu
Modified:	2024-05-02 10:12 UTC (History)
CC List:	4 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
emerge --info (emerge-info,9.45 KB, text/plain) 2024-04-01 02:53 UTC, znjameswu	Details
emerge --pqv (emerge-pqv,355 bytes, text/plain) 2024-04-01 02:54 UTC, znjameswu	Details
head of build.log (file_928339.txt,98.34 KB, text/plain) 2024-04-01 03:06 UTC, znjameswu	Details
tail of build.log (file_928339.txt,85.31 KB, text/plain) 2024-04-01 03:07 UTC, znjameswu	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description znjameswu 2024-04-01 02:51:53 UTC

After I updated to dev-util/nvidia-cuda-toolkit-12.4.0, sci-libs/caffe2-2.2.1-r1 fails to rebuild.

Downgrade to dev-util/nvidia-cuda-toolkit-12.3.2 results in a successful rebuild.



Reproducible: Always

Steps to Reproduce:
System Gcc: 
sys-devel/gcc-12.3.1_p20240209
sys-devel/gcc-13.2.1_p20240210

$ gcc-config -c
x86_64-pc-linux-gnu-13

System nvidia driver:
x11-drivers/nvidia-drivers-535.171.04
(This compile failure is also reproducible with x11-drivers/nvidia-drivers-550.67)
Actual Results:  
Section of failed compile message:

/var/tmp/portage/sci-libs/caffe2-2.2.1-r1/work/pytorch-2.2.1/aten/src/ATen/core/IListRef_inl.h:171:13: warning: possibly dangling reference to a temporary [-Wdangling-reference]
  171 |     const auto& ivalue = (*it).get();
      |             ^~~~~~
/var/tmp/portage/sci-libs/caffe2-2.2.1-r1/work/pytorch-2.2.1/aten/src/ATen/core/IListRef_inl.h:171:33: note: the temporary was destroyed at the end of the full expression ‘(& it)->c10::impl::ListIterator<std::optional<at::Tensor>, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator*().c10::impl::ListElementReference<std::optional<at::Tensor>, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::get()’
  171 |     const auto& ivalue = (*it).get();
      |                      ~~~~~~~~~~~^~
/var/tmp/portage/sci-libs/caffe2-2.2.1-r1/work/pytorch-2.2.1/aten/src/ATen/core/boxing/impl/boxing.h: At global scope:
/var/tmp/portage/sci-libs/caffe2-2.2.1-r1/work/pytorch-2.2.1/aten/src/ATen/core/boxing/impl/boxing.h:41:104: error: expected primary-expression before ‘>’ token
   41 | struct has_ivalue_to<T, guts::void_t<decltype(std::declval<IValue>().to<T>())>>
      |                                                                                                        ^
/var/tmp/portage/sci-libs/caffe2-2.2.1-r1/work/pytorch-2.2.1/aten/src/ATen/core/boxing/impl/boxing.h:41:107: error: expected primary-expression before ‘)’ token
   41 | struct has_ivalue_to<T, guts::void_t<decltype(std::declval<IValue>().to<T>())>>
      |                                                                                                           ^
/var/tmp/portage/sci-libs/caffe2-2.2.1-r1/work/pytorch-2.2.1/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h: In lambda function:
/var/tmp/portage/sci-libs/caffe2-2.2.1-r1/work/pytorch-2.2.1/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h:154:24: warning: possibly dangling reference to a temporary [-Wdangling-reference]
  154 |         for (const at::Tensor& tensor : ivalue.toTensorList()) {
      |                        ^~~~~~
/var/tmp/portage/sci-libs/caffe2-2.2.1-r1/work/pytorch-2.2.1/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h:154:53: note: the temporary was destroyed at the end of the full expression ‘__for_begin .c10::impl::ListIterator<at::Tensor, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator*().c10::impl::ListElementReference<at::Tensor, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator std::conditional_t<true, const at::Tensor&, at::Tensor>()’
  154 |         for (const at::Tensor& tensor : ivalue.toTensorList()) {
      |                                                     ^

Comment 1 znjameswu 2024-04-01 02:53:57 UTC

Created attachment 889168 [details]
emerge --info

Comment 2 znjameswu 2024-04-01 02:54:16 UTC

Created attachment 889169 [details]
emerge --pqv

Comment 3 znjameswu 2024-04-01 03:06:24 UTC

Created attachment 889175 [details]
head of build.log

The build.log is too large (7.8MB). I can only truncate it and paste the head and the tail section of the log. Hope that's sufficient

Comment 4 znjameswu 2024-04-01 03:07:47 UTC

Created attachment 889176 [details]
tail of build.log

Comment 5 Ștefan Talpalaru 2024-04-01 05:10:48 UTC

Upstream issue, with patches: https://github.com/pytorch/pytorch/issues/122169

These two patches fix the build for me:

https://github.com/pytorch/pytorch/commit/2a440348958b3f0a2b09458bd76fe5959b371c0c.patch

https://gitlab.archlinux.org/archlinux/packaging/packages/python-pytorch/-/blob/main/python-pytorch-fix-cuda-12_4.patch?ref_type=heads

Comment 6 Mike Gilbert gentoo-dev

2024-04-01 19:22:10 UTC

(In reply to znjameswu from comment #3)
> The build.log is too large (7.8MB). I can only truncate it and paste the
> head and the tail section of the log. Hope that's sufficient

In the future, please compress the log using gzip to make it small enough to attach.

Comment 7 Larry the Git Cow gentoo-dev

2024-04-04 09:20:53 UTC

The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=e74afa2c5f42fbc0ff31574a7796c7ee39727e9f

commit e74afa2c5f42fbc0ff31574a7796c7ee39727e9f
Author:     Alfredo Tupone <tupone@gentoo.org>
AuthorDate: 2024-04-04 09:19:35 +0000
Commit:     Alfredo Tupone <tupone@gentoo.org>
CommitDate: 2024-04-04 09:20:03 +0000

    sci-libs/caffe2: add 2.2.2
    
    Closes: https://bugs.gentoo.org/928339
    Signed-off-by: Alfredo Tupone <tupone@gentoo.org>

 sci-libs/caffe2/Manifest            |   1 +
 sci-libs/caffe2/caffe2-2.2.2.ebuild | 269 ++++++++++++++++++++++++++++++++++++
 2 files changed, 270 insertions(+)

Comment 8 Ștefan Talpalaru 2024-04-04 13:07:59 UTC

This bug has not been fixed by caffe2-2.2.2 and the two patches I linked earlier are still required.

Comment 9 Tupone Alfredo gentoo-dev

2024-04-05 18:43:12 UTC

(In reply to Ștefan Talpalaru from comment #8)
> This bug has not been fixed by caffe2-2.2.2 and the two patches I linked
> earlier are still required.

I knew. I put a blocking, wait for upstream, unless they 'll take too long.

Multumesc

Comment 10 Ștefan Talpalaru 2024-05-02 10:03:16 UTC

> I put a blocking, wait for upstream, unless they 'll take too long.

Upstream fixed it in 2.3.0, but the ebuild was not updated to remove the dep version limitation in "<dev-util/nvidia-cuda-toolkit-12.4.0:=[profiler]".

Comment 11 Ștefan Talpalaru 2024-05-02 10:12:48 UTC

Correction: it was a partial fix. They missed a patch - https://github.com/pytorch/pytorch/issues/122169