656582 – sci-libs/tensorflow-1.8.0-r1: can't install with cuda-9.1

Bug 656582 - sci-libs/tensorflow-1.8.0-r1: can't install with cuda-9.1

Summary: sci-libs/tensorflow-1.8.0-r1: can't install with cuda-9.1

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Jason Zaman

URL:
Whiteboard:
Keywords:

Duplicates (1):	659462 (view as bug list)
Depends on:
Blocks:

Reported:	2018-05-26 15:42 UTC by Yi Yang
Modified:	2018-07-08 15:40 UTC (History)
CC List:	3 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yi Yang 2018-05-26 15:42:13 UTC

The current ebuild seems to rely on the default behavior of the ./configure script shipped with Tensorflow, which including a default value of 9.0 for CUDA version. And as a consequence, trying to install tensorflow with a CUDA version other than 9.0 would result in failure.

Snippet of build log:

 * python2_7: running bazel_multibuild_wrapper do_configure
WARNING: ignoring LD_PRELOAD in environment.
Extracting Bazel installation...
You have bazel 0.13.0- (@non-git) installed.
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 

Please specify the location where CUDA 9.0 toolkit is installed. Refer to README.md for more details. [Default is /opt/cuda]: 

Invalid path to CUDA 9.0 toolkit. /opt/cuda/lib64/libcudart.so.9.0 cannot be found

Comment 1 Yi Yang 2018-05-27 15:58:53 UTC

It turns out that this line in the ebuild is also problematic:

> export GCC_HOST_COMPILER_PATH=$(tc-getCC)

Here is some explanation: tc-getCC returns the *name* of the C compiler (usually GCC), such as "x86_64-pc-linux-gnu-gcc". However, the configuration script of Tensorflow expects an absolute path of the C compiler. Hence it will complain that it cannot find the provided toolchain.

I resolved this problem by sneaking a "which" in that line, but that is probably not so cross-compiler friendly or prefix friendly, as my vim syntax highlighting suggested that is an error. Maybe a better fix would be to sed the configuration script.

Comment 2 younky.yang 2018-06-04 10:05:29 UTC

I can confirm the issue with CUDA 9.1, actually if I just built tensorflow from source, that CUDA 9.1 does work as expected. But I just don't know how to change the ebuild to make sure it can detect the installed CUDA version.

Comment 3 Jason Zaman gentoo-dev

2018-06-13 04:19:55 UTC

(In reply to Yi Yang from comment #1)
> It turns out that this line in the ebuild is also problematic:
> 
> > export GCC_HOST_COMPILER_PATH=$(tc-getCC)
> 
> Here is some explanation: tc-getCC returns the *name* of the C compiler
> (usually GCC), such as "x86_64-pc-linux-gnu-gcc". However, the configuration
> script of Tensorflow expects an absolute path of the C compiler. Hence it
> will complain that it cannot find the provided toolchain.
> 
> I resolved this problem by sneaking a "which" in that line, but that is
> probably not so cross-compiler friendly or prefix friendly, as my vim syntax
> highlighting suggested that is an error. Maybe a better fix would be to sed
> the configuration script.

I changed this to =$(which $(tc-getCC)) in the 1.9_rc0 ebuild, does that one work any better? Its probably not correct in the long run but its worth a shot for now.

Comment 4 Jason Zaman gentoo-dev

2018-06-13 04:22:41 UTC

(In reply to younky.yang from comment #2)
> I can confirm the issue with CUDA 9.1, actually if I just built tensorflow
> from source, that CUDA 9.1 does work as expected. But I just don't know how
> to change the ebuild to make sure it can detect the installed CUDA version.

Can you show me the differences between the .tf_configure.bazelrc when you built from source that works and the one in /var/tmp/portage/sci-libs/tensorflow*/work/tensorflow*python3_6/.tf_configure.bazelrc ? Hopefully they're configured differently that we can add to the ebuild.

Comment 5 Jason Zaman gentoo-dev

2018-06-27 06:59:03 UTC

I just pushed sci-libs/tensorflow-1.9.0_rc1-r2 to the tree, can you try it out.

I added stuff to set the cudnn and cuda versions properly now.
If you need to set the cuda capabilities, you can set eg 
TF_CUDA_COMPUTE_CAPABILITIES="6.1"
in your make.conf. I don't have a CUDA GPU yet so its still untested but it does build for me at least.

I also added a system-libs USE-flag which unbundles a bunch of deps. It would be cool if you guys could test with that on too.

Comment 6 Jason Zaman gentoo-dev

2018-06-28 06:07:16 UTC

*** Bug 659462 has been marked as a duplicate of this bug. ***

Comment 7 ZongyuZ 2018-06-28 08:14:57 UTC

(In reply to Jason Zaman from comment #5)
> I just pushed sci-libs/tensorflow-1.9.0_rc1-r2 to the tree, can you try it
> out.
> 
> I added stuff to set the cudnn and cuda versions properly now.
> If you need to set the cuda capabilities, you can set eg 
> TF_CUDA_COMPUTE_CAPABILITIES="6.1"
> in your make.conf. I don't have a CUDA GPU yet so its still untested but it
> does build for me at least.
> 
> I also added a system-libs USE-flag which unbundles a bunch of deps. It
> would be cool if you guys could test with that on too.

Hi Jason, I've tried _rc1-r2, and it printed out a message like this:
">=dev-util/nvidia-cuda-toolkit-9.0[profiler] required by (sci-libs/tensorflow-1.9.0_rc1-r2:0/0::gentoo, ebuild scheduled for merge".
But I can't emerge any >=nvidia-drivers-391.0, which is not capable for my gpu, and >=cuda-toolkit-9.0 depends on >=nvidia-drivers-391.0, or other dependencies like this.
So the situation is that I still can't emerge tensorflow right now, and I still need youe help. Thanks!

Comment 8 Jason Zaman gentoo-dev

2018-07-04 02:44:11 UTC

(In reply to ZongyuZ from comment #7)
> Hi Jason, I've tried _rc1-r2, and it printed out a message like this:
> ">=dev-util/nvidia-cuda-toolkit-9.0[profiler] required by
> (sci-libs/tensorflow-1.9.0_rc1-r2:0/0::gentoo, ebuild scheduled for merge".
> But I can't emerge any >=nvidia-drivers-391.0, which is not capable for my
> gpu, and >=cuda-toolkit-9.0 depends on >=nvidia-drivers-391.0, or other
> dependencies like this.
> So the situation is that I still can't emerge tensorflow right now, and I
> still need youe help. Thanks!

I am preparing a bump to _rc2 now and will release it soon after some more tests. If I lower the nvidia-cuda-toolkit version requirement to >=8 instead, will that work for you? The versions before _rc1-r2 didnt do the cuda setup properly so its not point trying to get them working. You're better off going up instead.

Comment 9 ZongyuZ 2018-07-06 14:10:01 UTC

> tests. If I lower the nvidia-cuda-toolkit version requirement to >=8
> instead, will that work for you? 

I tried to modify the ebuild of tensorflow-...-rc2, and try to compile it, but it seems cuda-8.0 is not going to work with glibc-2.26.

Here is a link:
https://devtalk.nvidia.com/default/topic/1023776/-request-add-nvcc-compatibility-with-glibc-2-26/

So I think lowering the requirement won't work for me. And thank you for your help. 
I'll try to downgrade glibc later (or just give up cuda)...

Comment 10 Jason Zaman gentoo-dev

2018-07-08 15:40:43 UTC

tensorflow-1.9.0_rc2 should work with cuda now. I lowered the cuda requirement to >=8 too.