Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 695428 - sci-libs/tensorflow: ebuilds break configuration of CUDA compute capability level
Summary: sci-libs/tensorflow: ebuilds break configuration of CUDA compute capability l...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal
Assignee: Jason Zaman
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-23 02:03 UTC by Soren Harward
Modified: 2019-12-08 17:26 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Soren Harward 2019-09-23 02:03:23 UTC
When compiling tensorflow with CUDA support, the user can set the CUDA compute capability level (see https://developer.nvidia.com/cuda-gpus for more info).  It's analogous to gcc's --arch flag: GPU kernels compiled for lower CUDA compute capability will run on higher hardware, but less efficiently.  Matching the tensorflow build to the CUDA compute level supported by the user's GPU is preferable, especially for compute-intensity applications like tensorflow.

Tensorflow's ./configure allows the compute capability level to be set one of three ways:

#1. Running /opt/cuda/extras/demo_suite/deviceQuery and parsing the output
#2. Prompting the user during the ./configure script
#3. Reading the environment variable TF_CUDA_COMPUTE_CAPABILITIES

#1 fails because sandboxing doesn't allow external programs to run.  #2 fails because the ebuild doesn't allow interaction with the ./configure script while it's running.  #3 fails because even if the user exports TF_CUDA_COMPUTE_CAPABILITIES in the shell before emerging tensorflow, that environment variable isn't passed through to the configuration script.  So tensorflow, when built on Gentoo, always falls back to the default CUDA compute level, which is to build GPU kernels for both 3.5 and 7.0 capability levels.  This is bad for users who have GPUs with different capability levels because they get less efficient compute kernels, and it's also bad because building tensorflow takes significantly longer because it has to compile all the GPU compute kernels twice.

This ebuild defect can be worked around by setting TF_CUDA_COMPUTE_CAPABILITIES via /etc/portage/env .  ebuild hackers may be able to figure out a better way to handle this.  But for now, I suggest displaying some kind of warning message at configure time if the TF_CUDA_COMPUTE_CAPABILITIES environment variable is not set, and brief instructions for setting it, so that users know that their build may be falling back to the wrong CUDA version(s).

Reproducible: Always
Comment 1 Soren Harward 2019-10-15 19:01:05 UTC
Update: I was wrong about #3.  emerge will find and use the TF_CUDA_COMPUTE_CAPABILITIES variable if it is set in the shell.  So maybe the best way to fix this bug is to have the ebuild display a warning during the configure stage if this environment variable is not set.  The warning could be something like:


WARNING: Tensorflow is being built with its default CUDA compute capabilities: 3.5 and 7.0.  These may not be optimal for your GPU.

To configure Tensorflow with the CUDA compute capability that is optimal for your GPU, set the environment variable TF_CUDA_COMPUTE_CAPABILITIES and then re-emerge tensorflow.  For example, to use CUDA capability 7.5, run:

$ TF_CUDA_COMPUTE_CAPABILITIES=7.5 emerge sci-libs/tensorflow

You can look up your GPU's CUDA compute capability at https://developer.nvidia.com/cuda-gpus or by running

$ /opt/cuda/extras/demo_suite/deviceQuery | grep "CUDA Capability"
Comment 2 Jason Zaman gentoo-dev 2019-12-08 10:44:48 UTC
you can also just put 
TF_CUDA_COMPUTE_CAPABILITIES=7.5
in your make.conf, but yeah i should put a note in the ebuild.
Comment 3 Larry the Git Cow gentoo-dev 2019-12-08 17:26:50 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=a3dc69074dcad86d4c95e024d231c90c62483152

commit a3dc69074dcad86d4c95e024d231c90c62483152
Author:     Jason Zaman <perfinion@gentoo.org>
AuthorDate: 2019-12-08 11:18:22 +0000
Commit:     Jason Zaman <perfinion@gentoo.org>
CommitDate: 2019-12-08 17:25:26 +0000

    sci-libs/tensorflow: fix bazel, jsoncpp deps
    
    Also add a message about setting cuda compute capability
    
    Closes: https://bugs.gentoo.org/695428
    Closes: https://bugs.gentoo.org/697864
    Closes: https://bugs.gentoo.org/702222
    Package-Manager: Portage-2.3.79, Repoman-2.3.16
    Signed-off-by: Jason Zaman <perfinion@gentoo.org>

 sci-libs/tensorflow/Manifest                     |  1 +
 sci-libs/tensorflow/tensorflow-1.15.0_rc0.ebuild | 16 ++++++++++++++--
 sci-libs/tensorflow/tensorflow-2.0.0.ebuild      | 16 ++++++++++++++--
 sci-libs/tensorflow/tensorflow-2.1.0_rc0.ebuild  | 19 ++++++++++++++++---
 4 files changed, 45 insertions(+), 7 deletions(-)