682672 – sci-libs/tensorflow-1.13.1 with dev-util/nvidia-cuda-toolkit-10.1.105 - ?

Bug 682672 - sci-libs/tensorflow-1.13.1 with dev-util/nvidia-cuda-toolkit-10.1.105 - ?

Summary: sci-libs/tensorflow-1.13.1 with dev-util/nvidia-cuda-toolkit-10.1.105 - ?

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Jason Zaman

URL:
Whiteboard:
Keywords:	PATCH

Depends on:
Blocks:

Reported:	2019-04-06 11:25 UTC by Jura
Modified:	2019-06-25 21:36 UTC (History)
CC List:	2 users (show)

See Also:	https://github.com/tensorflow/tensorflow/issues/26155
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jura 2019-04-06 11:25:01 UTC

1. I copy protobuf_temp_fix_cuda10.1.patch to /etc/portage/patches/dev-libs/protobuf-3.6.1.3/ and rebuild dev-libs/protobuf.

2. make links libcublas.so.10.1.0.105 -> libcublas.so.10.1
libcufft.so.10.1.105 -> libcufft.so.10.1
libcurand.so.10.1.105 -> libcurand.so.10.1
libcusolver.so.10.1.105 -> libcusolver.so.10.1

3. tensorflow build succeesful with nvidia-cuda-toolkit-10.1.105 and cudnn-7.5.0.56

4. ln -s /opt/cuda/lib64/libcublas.so.10.1.0.105 /usr/lib64/libcublas.so.10.1

5. python cifar10.py (simple tensorflow.python.keras model)


works fine:


2019-04-06 14:22:40.053167: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-06 14:22:40.053719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.815
pciBusID: 0000:01:00.0
totalMemory: 7.76GiB freeMemory: 6.76GiB
2019-04-06 14:22:40.053732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-06 14:22:40.054277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-06 14:22:40.054285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-04-06 14:22:40.054288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-04-06 14:22:40.054476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6579 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/lib64/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/lib64/python3.6/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Train on 45000 samples, validate on 5000 samples
WARNING:tensorflow:From /usr/lib64/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-04-06 14:22:40.858529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-06 14:22:40.858574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-06 14:22:40.858580: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-04-06 14:22:40.858584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-04-06 14:22:40.858792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6579 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Epoch 1/25
2019-04-06 14:22:41.397546: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.1 locally
 - 6s - loss: 1.7876 - acc: 0.3388 - val_loss: 1.4605 - val_acc: 0.4720
Epoch 2/25
 - 5s - loss: 1.3256 - acc: 0.5225 - val_loss: 1.1906 - val_acc: 0.5732
Epoch 3/25
 - 5s - loss: 1.1481 - acc: 0.5893 - val_loss: 0.9876 - val_acc: 0.6540
Epoch 4/25
 - 5s - loss: 1.0334 - acc: 0.6350 - val_loss: 0.9249 - val_acc: 0.6732
Epoch 5/25
 - 5s - loss: 0.9532 - acc: 0.6674 - val_loss: 0.8389 - val_acc: 0.7138
.......

Comment 1 Jura 2019-04-06 11:25:53 UTC

source: https://github.com/tensorflow/tensorflow/issues/26155#issuecomment-476705051

Comment 2 Jeroen Roovers (RETIRED) gentoo-dev

2019-04-06 11:42:11 UTC

So you wanted to talk about a patch?

Comment 3 Jura 2019-04-06 11:49:04 UTC

Yes. This patch for dev-libs/protobuf works for me

Comment 4 Raimund 2019-04-10 12:02:12 UTC

I can confirm that tensorflow compiles without problems this way (Cuda 10.1),
however, it did not compile with my old version of dev-libs/flatbuffers-1.8.0.

Works fine with flatbuffers-1.10.0, so the RDEPEND Section of the tensorflow ebuild "(python? ( ..." should be updated.


https://github.com/tensorflow/tensorflow/commit/b62cadc1513a73c1673094c9e35421c8a6c17645

Comment 5 Jason Zaman gentoo-dev

2019-06-19 09:18:54 UTC

Use tensorflow-1.14.0 for CUDA 10.1 instead

Comment 6 LE GARREC Vincent 2019-06-25 21:35:27 UTC

The problem is still there with tensorflow-1.14 and CUDA 10.1.105. I had to update to dev-util/nvidia-cuda-toolkit-10.1.168 (can be easily create from dev-util/nvidia-cuda-toolkit-10.1.105-r1).