695428 – sci-libs/tensorflow: ebuilds break configuration of CUDA compute capability level

Bug 695428 - sci-libs/tensorflow: ebuilds break configuration of CUDA compute capability level

Summary: sci-libs/tensorflow: ebuilds break configuration of CUDA compute capability l...

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Jason Zaman

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-09-23 02:03 UTC by Soren Harward
Modified:	2019-12-08 17:26 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Soren Harward 2019-09-23 02:03:23 UTC

When compiling tensorflow with CUDA support, the user can set the CUDA compute capability level (see https://developer.nvidia.com/cuda-gpus for more info). It's analogous to gcc's --arch flag: GPU kernels compiled for lower CUDA compute capability will run on higher hardware, but less efficiently. Matching the tensorflow build to the CUDA compute level supported by the user's GPU is preferable, especially for compute-intensity applications like tensorflow.

Tensorflow's ./configure allows the compute capability level to be set one of three ways:

#1. Running /opt/cuda/extras/demo_suite/deviceQuery and parsing the output
#2. Prompting the user during the ./configure script
#3. Reading the environment variable TF_CUDA_COMPUTE_CAPABILITIES

#1 fails because sandboxing doesn't allow external programs to run. #2 fails because the ebuild doesn't allow interaction with the ./configure script while it's running. #3 fails because even if the user exports TF_CUDA_COMPUTE_CAPABILITIES in the shell before emerging tensorflow, that environment variable isn't passed through to the configuration script. So tensorflow, when built on Gentoo, always falls back to the default CUDA compute level, which is to build GPU kernels for both 3.5 and 7.0 capability levels. This is bad for users who have GPUs with different capability levels because they get less efficient compute kernels, and it's also bad because building tensorflow takes significantly longer because it has to compile all the GPU compute kernels twice.

This ebuild defect can be worked around by setting TF_CUDA_COMPUTE_CAPABILITIES via /etc/portage/env . ebuild hackers may be able to figure out a better way to handle this. But for now, I suggest displaying some kind of warning message at configure time if the TF_CUDA_COMPUTE_CAPABILITIES environment variable is not set, and brief instructions for setting it, so that users know that their build may be falling back to the wrong CUDA version(s).

Reproducible: Always

Comment 1 Soren Harward 2019-10-15 19:01:05 UTC

Update: I was wrong about #3.  emerge will find and use the TF_CUDA_COMPUTE_CAPABILITIES variable if it is set in the shell.  So maybe the best way to fix this bug is to have the ebuild display a warning during the configure stage if this environment variable is not set.  The warning could be something like:


WARNING: Tensorflow is being built with its default CUDA compute capabilities: 3.5 and 7.0.  These may not be optimal for your GPU.

To configure Tensorflow with the CUDA compute capability that is optimal for your GPU, set the environment variable TF_CUDA_COMPUTE_CAPABILITIES and then re-emerge tensorflow.  For example, to use CUDA capability 7.5, run:

$ TF_CUDA_COMPUTE_CAPABILITIES=7.5 emerge sci-libs/tensorflow

You can look up your GPU's CUDA compute capability at https://developer.nvidia.com/cuda-gpus or by running

$ /opt/cuda/extras/demo_suite/deviceQuery | grep "CUDA Capability"

Comment 2 Jason Zaman gentoo-dev

2019-12-08 10:44:48 UTC

you can also just put 
TF_CUDA_COMPUTE_CAPABILITIES=7.5
in your make.conf, but yeah i should put a note in the ebuild.

Comment 3 Larry the Git Cow gentoo-dev

2019-12-08 17:26:50 UTC

The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=a3dc69074dcad86d4c95e024d231c90c62483152

commit a3dc69074dcad86d4c95e024d231c90c62483152
Author:     Jason Zaman <perfinion@gentoo.org>
AuthorDate: 2019-12-08 11:18:22 +0000
Commit:     Jason Zaman <perfinion@gentoo.org>
CommitDate: 2019-12-08 17:25:26 +0000

    sci-libs/tensorflow: fix bazel, jsoncpp deps
    
    Also add a message about setting cuda compute capability
    
    Closes: https://bugs.gentoo.org/695428
    Closes: https://bugs.gentoo.org/697864
    Closes: https://bugs.gentoo.org/702222
    Package-Manager: Portage-2.3.79, Repoman-2.3.16
    Signed-off-by: Jason Zaman <perfinion@gentoo.org>

 sci-libs/tensorflow/Manifest                     |  1 +
 sci-libs/tensorflow/tensorflow-1.15.0_rc0.ebuild | 16 ++++++++++++++--
 sci-libs/tensorflow/tensorflow-2.0.0.ebuild      | 16 ++++++++++++++--
 sci-libs/tensorflow/tensorflow-2.1.0_rc0.ebuild  | 19 ++++++++++++++++---
 4 files changed, 45 insertions(+), 7 deletions(-)