800824 – sci-libs/tensorflow-2.5.0 - crosstool_wrapper_driver_is_not_gcc & class google::protobuf::util::status_internal::Status has no member named error_message

Bug 800824 - sci-libs/tensorflow-2.5.0 - crosstool_wrapper_driver_is_not_gcc & class google::protobuf::util::status_internal::Status has no member named error_message

Summary: sci-libs/tensorflow-2.5.0 - crosstool_wrapper_driver_is_not_gcc & class googl...

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal normal
Assignee:	Jason Zaman

URL:
Whiteboard:
Keywords:

Duplicates (3):	802660 804564 805305 (view as bug list)
Depends on:
Blocks:

Reported:	2021-07-06 10:28 UTC by Bjoern Olausson
Modified:	2022-11-18 04:15 UTC (History)
CC List:	7 users (show)

See Also:	881445
Package list:
Runtime testing required:	---

Attachments
Tensorflow 2.5 build log (tensorflow-2.5.0-20210706-090837.log.bz2,110.16 KB, application/x-bzip) 2021-07-06 10:28 UTC, Bjoern Olausson	Details
emerge --info (emerge-info.log,7.11 KB, text/plain) 2021-07-06 10:29 UTC, Bjoern Olausson	Details
emerge -ept sci-libs/tensorflow (emerge-ept_sci-libs-tensorflow.log,47.25 KB, text/plain) 2021-07-06 10:30 UTC, Bjoern Olausson	Details
sci-libs:tensorflow-2.5.0:20210707-090010.log.bz2 (sci-libs_tensorflow-2.5.0_20210707-090010.log.bz2,251.04 KB, application/x-bzip) 2021-07-07 18:11 UTC, Bjoern Olausson	Details
tensorflow-2.5.0-r1.ebuild (tensorflow-2.5.0-r1.ebuild,15.23 KB, text/plain) 2021-07-14 19:13 UTC, Bjoern Olausson	Details
StatusMessage_TypeError.patch (StatusMessage_TypeError.patch,1.36 KB, patch) 2021-07-14 19:13 UTC, Bjoern Olausson	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bjoern Olausson 2021-07-06 10:28:23 UTC

Created attachment 722338 [details]
Tensorflow 2.5 build log

Hej,

I tried to compile tensorflow 2.4 and 2.5 both fail.
A couple of month ago tensorflow 2.4 compiled just fine but I had uninstall Tensorflow because it was blocking the GCC update from 10.2 to 10.3 and the python 3.8/3.9 update.

Now I need Tensorflow  again so I masked GCC >= 10.3 and added the python_targets_python3_8 to (hopefully) all packages Tensorflow depends.

(In the meantime the system was upgraded from a 4 core IvyBridge CPU with 24 GB RAM to a 6 Core SkyLake CPU with 64 GB RAM)

I a desperate measure I also tried to compile with GCC 9.3.0-r2 after that I did a "emerge -e @world" with GCC 10.2 - neither solved the issue.

I removed "-march=native" from CFLAGS but it didn't help.
I even added 64 GiB as swapfile - just in case but it didn't help.

I even tried min and max supported versions of bazel and other dependencies (certainly not an exhaustive permutation I did there...)

The errors I see in the attached buildlog:

[...]

[1A[K[32m[12,095 / 19,390][0m 6 actions running
    Compiling tensorflow/core/kernels/list_kernels.cu.cc; 24s local
    Compiling tensorflow/core/kernels/dynamic_stitch_op_gpu.cu.cc; 10s local
    Compiling .../kernels/strided_slice_op_gpu_number_types.cu.cc; 7s local
    Compiling tensorflow/core/kernels/example_parsing_ops.cc; 3s local
    Compiling tensorflow/core/kernels/linalg/matrix_set_diag_op.cc; 2s local
    Compiling tensorflow/core/kernels/list_kernels.cc; 1s local

[1A[K
[1A[K
[1A[K
[1A[K
[1A[K
[1A[K
[1A[K[31m[1mERROR: [0m/var/tmp/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-python3_8/tensorflow/core/kernels/BUILD:4393:18: C++ compilation of rule '//tensorflow/core/kernels:example_parsing_ops' failed (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command 
  (cd /var/tmp/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-python3_8-bazel-base/execroot/org_tensorflow && \
  exec env - \
    CUDA_TOOLKIT_PATH=/opt/cuda \
    GCC_HOST_COMPILER_PATH=/usr/x86_64-pc-linux-gnu/gcc-bin/10.2.0/x86_64-pc-linux-gnu-gcc \
    HOME=/var/tmp/portage/sci-libs/tensorflow-2.5.0/homedir \
    KERAS_HOME=/var/tmp/portage/sci-libs/tensorflow-2.5.0/temp/.keras \
    PATH=/var/tmp/portage/sci-libs/tensorflow-2.5.0/temp/python3.8/bin:/usr/lib/portage/python3.9/ebuild-helpers/xattr:/usr/lib/portage/python3.9/ebuild-helpers:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:/usr/lib/llvm/12/bin:/usr/lib/llvm/11/bin:/opt/cuda/bin \
    PWD=/proc/self/cwd \
    PYTHON_BIN_PATH=/usr/bin/python3.8 \
    PYTHON_LIB_PATH=/usr/lib/python3.8/site-packages \
    TF2_BEHAVIOR=1 \
    TF_CUDA_COMPUTE_CAPABILITIES=6.1 \
    TF_CUDA_PATHS=/opt/cuda \
    TF_CUDA_VERSION=11.1 \
    TF_CUDNN_VERSION=8.0 \
    TF_SYSTEM_LIBS=absl_py,astor_archive,astunparse_archive,boringssl,com_github_googlecloudplatform_google_cloud_cpp,com_github_grpc_grpc,com_google_protobuf,curl,cython,dill_archive,double_conversion,enum34_archive,flatbuffers,functools32_archive,gast_archive,gif,hwloc,icu,jsoncpp_git,libjpeg_turbo,lmdb,nasm,nsync,opt_einsum_archive,org_sqlite,pasta,pcre,png,pybind11,six_archive,snappy,tblib_archive,termcolor_archive,typing_extensions_archive,wrapt,zlib \
  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/tensorflow/core/kernels/_objs/example_parsing_ops/example_parsing_ops.pic.d '-frandom-seed=bazel-out/k8-opt/bin/tensorflow/core/kernels/_objs/example_parsing_ops/example_parsing_ops.pic.o' -DTF_USE_SNAPPY -DEIGEN_MPL2_ONLY '-DEIGEN_MAX_ALIGN_BYTES=64' -iquote . -iquote bazel-out/k8-opt/bin -iquote external/com_google_absl -iquote bazel-out/k8-opt/bin/external/com_google_absl -iquote external/nsync -iquote bazel-out/k8-opt/bin/external/nsync -iquote external/eigen_archive -iquote bazel-out/k8-opt/bin/external/eigen_archive -iquote external/gif -iquote bazel-out/k8-opt/bin/external/gif -iquote external/libjpeg_turbo -iquote bazel-out/k8-opt/bin/external/libjpeg_turbo -iquote external/com_google_protobuf -iquote bazel-out/k8-opt/bin/external/com_google_protobuf -iquote external/com_googlesource_code_re2 -iquote bazel-out/k8-opt/bin/external/com_googlesource_code_re2 -iquote external/farmhash_archive -iquote bazel-out/k8-opt/bin/external/farmhash_archive -iquote external/fft2d -iquote bazel-out/k8-opt/bin/external/fft2d -iquote external/highwayhash -iquote bazel-out/k8-opt/bin/external/highwayhash -iquote external/zlib -iquote bazel-out/k8-opt/bin/external/zlib -iquote external/local_config_cuda -iquote bazel-out/k8-opt/bin/external/local_config_cuda -iquote external/local_config_rocm -iquote bazel-out/k8-opt/bin/external/local_config_rocm -iquote external/local_config_tensorrt -iquote bazel-out/k8-opt/bin/external/local_config_tensorrt -iquote external/double_conversion -iquote bazel-out/k8-opt/bin/external/double_conversion -iquote external/snappy -iquote bazel-out/k8-opt/bin/external/snappy -iquote external/curl -iquote bazel-out/k8-opt/bin/external/curl -iquote external/boringssl -iquote bazel-out/k8-opt/bin/external/boringssl -iquote external/jsoncpp_git -iquote bazel-out/k8-opt/bin/external/jsoncpp_git -Ibazel-out/k8-opt/bin/external/local_config_cuda/cuda/_virtual_includes/cuda_headers_virtual -Ibazel-out/k8-opt/bin/external/local_config_tensorrt/_virtual_includes/tensorrt_headers -Ibazel-out/k8-opt/bin/external/local_config_cuda/cuda/_virtual_includes/cudnn_header -isystem third_party/eigen3/mkl_include -isystem bazel-out/k8-opt/bin/third_party/eigen3/mkl_include -isystem external/eigen_archive -isystem bazel-out/k8-opt/bin/external/eigen_archive -isystem external/farmhash_archive/src -isystem bazel-out/k8-opt/bin/external/farmhash_archive/src -isystem external/local_config_cuda/cuda -isystem bazel-out/k8-opt/bin/external/local_config_cuda/cuda -isystem external/local_config_cuda/cuda/cuda/include -isystem bazel-out/k8-opt/bin/external/local_config_cuda/cuda/cuda/include -isystem external/local_config_rocm/rocm -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm -isystem external/local_config_rocm/rocm/rocm/include -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include -isystem external/local_config_rocm/rocm/rocm/include/rocrand -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include/rocrand -isystem external/local_config_rocm/rocm/rocm/include/roctracer -isystem bazel-out/k8-opt/bin/external/local_config_rocm/rocm/rocm/include/roctracer -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -fPIC -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -Wall -fno-omit-frame-pointer -no-canonical-prefixes -fno-canonical-system-headers -DNDEBUG -g0 -O2 -ffunction-sections -fdata-sections -w -DAUTOLOAD_DYNAMIC_KERNELS -I/usr/include/jsoncpp '-std=c++14' -O2 -pipe -msse -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -mfma -DEIGEN_AVOID_STL_ARRAY -Iexternal/gemmlowp -Wno-sign-compare '-ftemplate-depth=900' -fno-exceptions '-DGOOGLE_CUDA=1' '-DTENSORFLOW_USE_NVCC=1' '-DTENSORFLOW_USE_XLA=1' -DINTEL_MKL -msse3 -pthread -DNV_CUDNN_DISABLE_EXCEPTION '-DGOOGLE_CUDA=1' -DNV_CUDNN_DISABLE_EXCEPTION '-DTENSORFLOW_USE_XLA=1' '-DINTEL_MKL=1' -c tensorflow/core/kernels/example_parsing_ops.cc -o bazel-out/k8-opt/bin/tensorflow/core/kernels/_objs/example_parsing_ops/example_parsing_ops.pic.o)
Execution platform: @local_execution_config_platform//:platform
[32m[12,096 / 19,390][0m 5 actions running
    Compiling tensorflow/core/kernels/list_kernels.cu.cc; 25s local
    Compiling tensorflow/core/kernels/dynamic_stitch_op_gpu.cu.cc; 11s local
    Compiling .../kernels/strided_slice_op_gpu_number_types.cu.cc; 8s local
    Compiling tensorflow/core/kernels/linalg/matrix_set_diag_op.cc; 3s local
    Compiling tensorflow/core/kernels/list_kernels.cc; 1s local

[1A[K
[1A[K
[1A[K
[1A[K
[1A[K
[1A[KIn file included from ./tensorflow/core/framework/op_kernel.h:35,
                 from ./tensorflow/core/framework/numeric_op.h:19,
                 from tensorflow/core/kernels/example_parsing_ops.cc:27:
tensorflow/core/kernels/example_parsing_ops.cc: In member function ‘virtual void tensorflow::DecodeJSONExampleOp::Compute(tensorflow::OpKernelContext*)’:
tensorflow/core/kernels/example_parsing_ops.cc:1221:57: error: ‘class google::protobuf::util::status_internal::Status’ has no member named ‘error_message’; did you mean ‘error_message_’?
 1221 |                                           string(status.error_message())));
      |                                                         ^~~~~~~~~~~~~
./tensorflow/core/framework/op_requires.h:45:46: note: in definition of macro ‘OP_REQUIRES’
   45 |       (CTX)->CtxFailure(__FILE__, __LINE__, (STATUS));    \
      |                                              ^~~~~~
tensorflow/core/kernels/example_parsing_ops.cc:1221:57: error: ‘std::string google::protobuf::util::status_internal::Status::error_message_’ is private within this context
 1221 |                                           string(status.error_message())));
      |                                                         ^~~~~~~~~~~~~
./tensorflow/core/framework/op_requires.h:45:46: note: in definition of macro ‘OP_REQUIRES’
   45 |       (CTX)->CtxFailure(__FILE__, __LINE__, (STATUS));    \
      |                                              ^~~~~~
In file included from /usr/include/google/protobuf/stubs/logging.h:36,
                 from /usr/include/google/protobuf/io/coded_stream.h:150,
                 from bazel-out/k8-opt/bin/tensorflow/core/protobuf/error_codes.pb.h:23,
                 from ./tensorflow/core/platform/status.h:30,
                 from ./tensorflow/core/lib/core/status.h:19,
                 from ./tensorflow/core/lib/monitoring/counter.h:37,
                 from ./tensorflow/core/framework/metrics.h:19,
                 from ./tensorflow/core/common_runtime/metrics.h:22,
                 from tensorflow/core/kernels/example_parsing_ops.cc:23:
/usr/include/google/protobuf/stubs/status.h:97:15: note: declared private here
   97 |   std::string error_message_;
      |               ^~~~~~~~~~~~~~
[32m[12,096 / 19,390][0m 5 actions running
    Compiling tensorflow/core/kernels/list_kernels.cu.cc; 25s local
    Compiling tensorflow/core/kernels/dynamic_stitch_op_gpu.cu.cc; 11s local
    Compiling .../kernels/strided_slice_op_gpu_number_types.cu.cc; 8s local
    Compiling tensorflow/core/kernels/linalg/matrix_set_diag_op.cc; 3s local
    Compiling tensorflow/core/kernels/list_kernels.cc; 1s local

[1A[K
[1A[K
[1A[K
[1A[K
[1A[K
[1A[K[32mINFO: [0mElapsed time: 2826.482s, Critical Path: 141.00s
[32m[12,101 / 19,390][0m checking cached actions

[1A[K[32mINFO: [0m12101 processes: 5934 internal, 6167 local.
[32m[12,101 / 19,390][0m checking cached actions

[1A[K[31m[1mFAILED:[0m Build did NOT complete successfully

[1A[K[31m[1mFAILED:[0m Build did NOT complete successfully
[0m * ERROR: sci-libs/tensorflow-2.5.0::gentoo failed (compile phase):
 *   ebazel failed
 * 
 * Call stack:
 *     ebuild.sh, line  127:  Called src_compile
 *   environment, line 4158:  Called ebazel 'build' '//tensorflow:libtensorflow_framework.so' '//tensorflow:libtensorflow.so'
 *   environment, line 2510:  Called die
 * The specific snippet of code:
 *       "${@}" || die "ebazel failed"
 * 
 * If you need support, post the output of `emerge --info '=sci-libs/tensorflow-2.5.0::gentoo'`,
 * the complete build log and the output of `emerge -pqv '=sci-libs/tensorflow-2.5.0::gentoo'`.
 * The complete build log is located at '/var/log/portage/sci-libs:tensorflow-2.5.0:20210706-090837.log'.
 * For convenience, a symlink to the build log is located at '/var/tmp/portage/sci-libs/tensorflow-2.5.0/temp/build.log'.
 * The ebuild environment file is located at '/var/tmp/portage/sci-libs/tensorflow-2.5.0/temp/environment'.
 * Working directory: '/var/tmp/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-python3_8'
 * S: '/var/tmp/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0'

Any ideas how to make Tensoreflow compile again are welcome :)

Cheers,
Bjoern

Comment 1 Bjoern Olausson 2021-07-06 10:29:14 UTC

Created attachment 722341 [details]
emerge --info

Comment 2 Bjoern Olausson 2021-07-06 10:30:00 UTC

Created attachment 722344 [details]
emerge -ept sci-libs/tensorflow

Comment 3 Arfrever Frehtes Taifersar Arahesis 2021-07-06 10:59:23 UTC

This is already since ProtoBuf 3.16.0:

https://github.com/protocolbuffers/protobuf/pull/8354
https://github.com/protocolbuffers/protobuf/commit/9ad97629be72eeecf8bc9fe8145e55ceaeab6b78#diff-26f14c21bd27b6500347fdacdeea49b8bccde636aab2ecae545515e76a5a48bdL96-L98

As seen below this deleted function, the solution is to use message() instead of error_message(). (Both of them were defined identically.)

Comment 4 Bjoern Olausson 2021-07-06 13:12:04 UTC

(In reply to Arfrever Frehtes Taifersar Arahesis from comment #3)
> This is already since ProtoBuf 3.16.0:
> 
> https://github.com/protocolbuffers/protobuf/pull/8354
> https://github.com/protocolbuffers/protobuf/commit/
> 9ad97629be72eeecf8bc9fe8145e55ceaeab6b78#diff-
> 26f14c21bd27b6500347fdacdeea49b8bccde636aab2ecae545515e76a5a48bdL96-L98
> 
> As seen below this deleted function, the solution is to use message()
> instead of error_message(). (Both of them were defined identically.)

Thanks for that hint!

I masked >=dev-libs/protobuf-3.16.0 & >=dev-python/protobuf-python-3.16.0 and I and the compiler seems to be beyond the point I reported above + it is now running for ~2 hours compared to ~30 minutes before masking protobuf.

The most straight forward way to fix this would be to require dev-libs/protobuf-3.15.8 & dev-python/protobuf-python-3.15.8 in the Tensorflow-2.5 ebuild - Am I really the first one to trip the version incompatibility of protobuf >=3.16.0 with Tensorflow on Gentoo?

If not addressed upstream, will a patch to address the root cause as mentioned by Arfrever Frehtes Taifersar Arahesis be be feasible?

I'll post an update once Tensorflow has been compiled entirely to confirm the above.

Cheers,
Bjoern

Comment 5 Bjoern Olausson 2021-07-06 20:15:10 UTC

After 5 hours, 32 minutes and 19 seconds Tensorflow 2.5 compiled successfully and is working as expected.

The solution to mask >=dev-libs/protobuf-3.16.0 & >=dev-python/protobuf-python-3.16.0 worked for me.

Cheers,
Bjoern

Comment 6 Bjoern Olausson 2021-07-06 21:44:38 UTC

Okay, maybe downgrading protobuf is not the way to go or it is simply a bug in tensorflow (https://github.com/tensorflow/tensorflow/issues/50545)

    model = keras.Sequential(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 522, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/sequential.py", line 114, in __init__
    super(functional.Functional, self).__init__(  # pylint: disable=bad-super-call
  File "/usr/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 522, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 318, in __init__
    self._init_batch_counters()
  File "/usr/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 522, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 326, in _init_batch_counters
    self._train_counter = variables.Variable(0, dtype='int64', aggregation=agg)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 262, in __call__
    return cls._variable_v2_call(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 244, in _variable_v2_call
    return previous_getter(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 237, in <lambda>
    previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/variable_scope.py", line 2662, in default_variable_creator_v2
    return resource_variable_ops.ResourceVariable(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1584, in __init__
    self._init_from_args(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1738, in _init_from_args
    handle = eager_safe_variable_handle(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 237, in eager_safe_variable_handle
    return _variable_handle_from_shape_and_dtype(shape, dtype, shared_name, name,
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 177, in _variable_handle_from_shape_and_dtype
    cpp_shape_inference_pb2.CppShapeInferenceResult.HandleShapeAndType(
TypeError: Parameter to MergeFrom() must be instance of same class: expected tensorflow.TensorShapeProto got tensorflow.TensorShapeProto.

Comment 7 ykui 2021-07-07 13:52:03 UTC

I had the same experience (failed with later protobuf, masked it,
thought it worked with earlier -- ran into the runtime error above
when running a test-suite).

I was able to get past the error_message-member issue with the patch
suggested here https://bugs.gentoo.org/800824#c3 
```
--- a/tensorflow/core/kernels/example_parsing_ops.cc	2021-07-07 11:12:34.110293208 +0200
+++ b/tensorflow/core/kernels/example_parsing_ops.cc	2021-07-07 11:13:04.013291922 +0200
@@ -1218,7 +1218,7 @@
           resolver_.get(), "type.googleapis.com/tensorflow.Example", &in, &out);
       OP_REQUIRES(ctx, status.ok(),
                   errors::InvalidArgument("Error while parsing JSON: ",
-                                          string(status.error_message())));
+                                          string(status.message())));
     }
   }

```

but I am running into some later build issues (portions of the error
messages/log is excerpted below)

```
ERROR: /var/tmp/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-python3_8/tensorflow/core/kernels/BUILD:5337:18: C++ compilation of rule '//tensorflow/core/kernels:multinomial_op_gpu' failed (Exit 2): crosstool_wrapper_driver_is_not_gcc failed: error executing command 
  (cd /var/tmp/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-python3_8-bazel-base/execroot/org_tensorflow && \
...
external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorMap.h(318): error: unrecognized token
external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorMap.h(318): error: expected a ","

2 errors detected in the compilation of "tensorflow/core/kernels/multinomial_op_gpu.cu.cc".
```

Comment 8 Bjoern Olausson 2021-07-07 18:11:18 UTC

Created attachment 722599 [details]
sci-libs:tensorflow-2.5.0:20210707-090010.log.bz2

Same partial success here... after hours of compile time...

ebuild /usr/portage/sci-libs/tensorflow/tensorflow-2.5.0.ebuild unpack

sed -e 's|status.error_message|status.message|g' -i /dev/shm/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0/tensorflow/core/kernels/example_parsing_ops.cc

ebuild /usr/portage/sci-libs/tensorflow/tensorflow-2.5.0.ebuild compile

[...]
[23,726 / 25,196] Compiling tensorflow/compiler/tf2xla/kernels/stateless_random_ops.cc [for host]; 15s local ... (12 actions, 11 running)
[26,009 / 27,736] Compiling tensorflow/core/kernels/unique_op_gpu.cu.cc; 86s local ... (12 actions, 11 running)
[27,937 / 29,220] Compiling tensorflow/compiler/xla/service/spmd/spmd_partitioner.cc; 22s local ... (12 actions, 11 running)
ERROR: /dev/shm/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-python3_8-bazel-base/external/nccl_archive/BUILD.bazel:54:17: C++ compilation of rule '@nccl_archive//:device_lib' failed (Exit 6): crosstool_wrapper_driver_is_not_gcc failed: error executing command 
  (cd /dev/shm/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-python3_8-bazel-base/execroot/org_tensorflow && \
  exec env - \
    CUDA_TOOLKIT_PATH=/opt/cuda \
    GCC_HOST_COMPILER_PATH=/usr/x86_64-pc-linux-gnu/gcc-bin/10.2.0/x86_64-pc-linux-gnu-gcc \
    HOME=/dev/shm/portage/sci-libs/tensorflow-2.5.0/homedir \
    KERAS_HOME=/dev/shm/portage/sci-libs/tensorflow-2.5.0/temp/.keras \
    PATH=/dev/shm/portage/sci-libs/tensorflow-2.5.0/temp/python3.8/bin:/dev/shm/portage/sci-libs/tensorflow-2.5.0/temp/python3.8/bin:/usr/lib/portage/python3.9/ebuild-helpers/xattr:/usr/lib/portage/python3.9/ebuild-helpers:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:/usr/lib/llvm/12/bin:/usr/lib/llvm/11/bin:/opt/cuda/bin \
    PWD=/proc/self/cwd \
    PYTHON_BIN_PATH=/usr/bin/python3.8 \
    PYTHON_LIB_PATH=/usr/lib/python3.8/site-packages \
    TF2_BEHAVIOR=1 \
    TF_CUDA_COMPUTE_CAPABILITIES=6.1 \
    TF_CUDA_PATHS=/opt/cuda \
    TF_CUDA_VERSION=11.1 \
    TF_CUDNN_VERSION=8.0 \
    TF_SYSTEM_LIBS=absl_py,astor_archive,astunparse_archive,boringssl,com_github_googlecloudplatform_google_cloud_cpp,com_github_grpc_grpc,com_google_protobuf,curl,cython,dill_archive,double_conversion,enum34_archive,flatbuffers,functools32_archive,gast_archive,gif,hwloc,icu,jsoncpp_git,libjpeg_turbo,lmdb,nasm,nsync,opt_einsum_archive,org_sqlite,pasta,pcre,png,pybind11,six_archive,snappy,tblib_archive,termcolor_archive,typing_extensions_archive,wrapt,zlib \
  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -MD -MF bazel-out/k8-opt/bin/external/nccl_archive/_objs/device_lib/max_f32_reduce.cu.d '-frandom-seed=bazel-out/k8-opt/bin/external/nccl_archive/_objs/device_lib/max_f32_reduce.cu.o' -iquote external/nccl_archive -iquote bazel-out/k8-opt/bin/external/nccl_archive -iquote external/local_config_cuda -iquote bazel-out/k8-opt/bin/external/local_config_cuda -Ibazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/device_hdrs -Ibazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs -Ibazel-out/k8-opt/bin/external/local_config_cuda/cuda/_virtual_includes/cuda_headers_virtual -Ibazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/src_hdrs -isystem external/local_config_cuda/cuda -isystem bazel-out/k8-opt/bin/external/local_config_cuda/cuda -isystem external/local_config_cuda/cuda/cuda/include -isystem bazel-out/k8-opt/bin/external/local_config_cuda/cuda/cuda/include -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -fPIE -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -Wall -fno-omit-frame-pointer -no-canonical-prefixes -fno-canonical-system-headers -DNDEBUG -g0 -O2 -ffunction-sections -fdata-sections -w -DAUTOLOAD_DYNAMIC_KERNELS -I/usr/include/jsoncpp '-std=c++14' '-march=native' -O2 -pipe -msse -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -mfma -x cuda '-DGOOGLE_CUDA=1' '-Xcuda-fatbinary=--compress-all' '--no-cuda-include-ptx=all' '--cuda-include-ptx=sm_61' '--cuda-gpu-arch=sm_61' -nvcc_options 'relocatable-device-code=true' -nvcc_options 'ptxas-options=-maxrregcount=96' -c bazel-out/k8-opt/bin/external/nccl_archive/src/collectives/device/max_f32_reduce.cu.cc -o bazel-out/k8-opt/bin/external/nccl_archive/_objs/device_lib/max_f32_reduce.cu.o)
Execution platform: @local_execution_config_platform//:platform
double free or corruption (out)
nvcc error   : 'cicc' died due to signal 6 
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 9494.469s, Critical Path: 261.97s
INFO: 24170 processes: 3290 internal, 20880 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
 * ERROR: sci-libs/tensorflow-2.5.0::gentoo failed (compile phase):
 *   ebazel failed
 * 
 * Call stack:
 *     ebuild.sh, line  127:  Called src_compile
 *   environment, line 4168:  Called python_foreach_impl 'run_in_build_dir' 'do_compile'
 *   environment, line 3760:  Called multibuild_foreach_variant '_python_multibuild_wrapper' 'run_in_build_dir' 'do_compile'
 *   environment, line 3236:  Called _multibuild_run '_python_multibuild_wrapper' 'run_in_build_dir' 'do_compile'
 *   environment, line 3234:  Called _python_multibuild_wrapper 'run_in_build_dir' 'do_compile'
 *   environment, line 1089:  Called run_in_build_dir 'do_compile'
 *   environment, line 4140:  Called do_compile
 *   environment, line 4164:  Called ebazel 'build' '//tensorflow/tools/pip_package:build_pip_package'
 *   environment, line 2512:  Called die
 * The specific snippet of code:
 *       "${@}" || die "ebazel failed"
 * 
 * If you need support, post the output of `emerge --info '=sci-libs/tensorflow-2.5.0::gentoo'`,
 * the complete build log and the output of `emerge -pqv '=sci-libs/tensorflow-2.5.0::gentoo'`.
 * The complete build log is located at '/var/log/portage/sci-libs:tensorflow-2.5.0:20210707-090010.log'.
 * For convenience, a symlink to the build log is located at '/dev/shm/portage/sci-libs/tensorflow-2.5.0/temp/build.log'.
 * The ebuild environment file is located at '/dev/shm/portage/sci-libs/tensorflow-2.5.0/temp/environment'.
 * Working directory: '/dev/shm/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-python3_8'
 * S: '/dev/shm/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0'

Any ideas hos to get around that?

Comment 9 Bjoern Olausson 2021-07-08 08:59:22 UTC

Oookay, apparently 64 GB of RAM is not enough when using -j12 and /dev/shm to compile tensor flow.

I switched back to the default portage dirs
#PORTAGE_TMPFS="/dev/shm"
#PORTAGE_TMPDIR="/dev/shm"
#BUILD_PREFIX="/dev/shm"

and used 6 instead of 12 jobs
MAKEOPTS="-j6"

After that, the following procedure worked for me to compile sci-libs/tensorflow-2.5.0

ebuild /usr/portage/sci-libs/tensorflow/tensorflow-2.5.0.ebuild unpack

sed -e 's|status.error_message|status.message|g' -i /var/tmp/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0/tensorflow/core/kernels/example_parsing_ops.cc

MAKEOPTS="-j6" ; ebuild /usr/portage/sci-libs/tensorflow/tensorflow-2.5.0.ebuild compile

 * Package:    sci-libs/tensorflow-2.5.0
 * Repository: gentoo
 * Maintainer: perfinion@gentoo.org
 * USE:        abi_x86_64 amd64 cpu_flags_x86_avx cpu_flags_x86_avx2 cpu_flags_x86_fma3 cpu_flags_x86_sse cpu_flags_x86_sse2 cpu_flags_x86_sse3 cpu_flags_x86_sse4_1 cpu_flags_x86_sse4_2 cuda elibc_glibc kernel_linux python python_targets_python3_8 userland_GNU xla
 * FEATURES:   network-sandbox preserve-libs sandbox userpriv usersandbox
 * Checking for at least 5 GiB RAM ...
 [ ok ]
 * Checking for at least 10 GiB disk space at "/var/tmp/portage/sci-libs/tensorflow-2.5.0/temp" ...
 [ ok ]
 * Package:    sci-libs/tensorflow-2.5.0
 * Repository: gentoo
 * Maintainer: perfinion@gentoo.org
 * USE:        abi_x86_64 amd64 cpu_flags_x86_avx cpu_flags_x86_avx2 cpu_flags_x86_fma3 cpu_flags_x86_sse cpu_flags_x86_sse2 cpu_flags_x86_sse3 cpu_flags_x86_sse4_1 cpu_flags_x86_sse4_2 cuda elibc_glibc kernel_linux python python_targets_python3_8 userland_GNU xla
 * FEATURES:   network-sandbox preserve-libs sandbox userpriv usersandbox
 * TensorFlow 2.0 is a major release that contains some incompatibilities
 * with TensorFlow 1.x. For more information about migrating to TF2.0 see:
 * https://www.tensorflow.org/guide/migrate
 * python3_8: running count_impls
 * Checking for at least 5 GiB RAM ...
 [ ok ]
 * Checking for at least 16 GiB disk space at "/var/tmp/portage/sci-libs/tensorflow-2.5.0/temp" ...
 [ ok ]
>>> Unpacking source...
>>> Unpacking tensorflow-2.5.0.tar.gz to /var/tmp/portage/sci-libs/tensorflow-2.5.0/work
>>> Unpacking tensorflow-patches-2.5.0.tar.bz2 to /var/tmp/portage/sci-libs/tensorflow-2.5.0/work

[...]

[27,858 / 29,047] Compiling tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc; 17s local ... (6 actions, 5 running)
[31,211 / 31,624] Compiling tensorflow/compiler/tf2xla/kernels/lower_upper_bound_ops.cc; 9s local ... (6 actions, 5 running)
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 12778.919s, Critical Path: 190.88s
INFO: 26436 processes: 3364 internal, 23072 local.
INFO: Build completed successfully, 26436 total actions
INFO: Build completed successfully, 26436 total actions
bazel --bazelrc=/var/tmp/portage/sci-libs/tensorflow-2.5.0/temp/bazelrc --output_base=/var/tmp/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-bazel-base shutdown
WARNING: Running command "shutdown" in batch mode.  Batch mode is triggered
when not running Bazel within a workspace. If you intend to shutdown an
existing Bazel server, run "bazel shutdown" from the directory where
it was started.
WARNING: ignoring LD_PRELOAD in environment.
>>> Source compiled.


MAKEOPTS="-j6" ; ebuild /usr/portage/sci-libs/tensorflow/tensorflow-2.5.0.ebuild merge

[...]

>>> Completed installing sci-libs/tensorflow-2.5.0 into /var/tmp/portage/sci-libs/tensorflow-2.5.0/image                                                                          
                                                                                                                                                                                  
 * Final size of build directory: 17706292 KiB (16.8 GiB)                                                                                                                         
 * Final size of installed tree:   1888412 KiB ( 1.8 GiB)

 * QA Notice: DISTUTILS_USE_SETUPTOOLS is not used when DISTUTILS_OPTIONAL
 * is enabled.
 * Verifying compiled files in /usr/lib/python3.8/site-packages
 * 
 * QA Notice: This package seems to contain tests but they are not enabled.
 * Please either run tests (via distutils_enable_tests or declaring
 * python_test yourself), or add RESTRICT="test" along with an explanatory
 * comment if tests cannot be run.
 * 

[...]

>>> /usr/lib64/libtensorflow.so -> libtensorflow.so.2
>>> sci-libs/tensorflow-2.5.0 merged.
>>> Regenerating /etc/ld.so.cache...

emerge -1av sci-visualization/tensorboard

but still no luck using tensor flow:

model = keras.Sequential(
    [
        keras.Input(shape=(76, 36, 1)),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(10, activation="softmax"),
    ]
)


-> physical GPU (device: 0, name: NVIDIA GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "./6-train-model.py", line 198, in <module>
    model = keras.Sequential(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 522, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/sequential.py", line 114, in __init__
    super(functional.Functional, self).__init__(  # pylint: disable=bad-super-call
  File "/usr/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 522, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 318, in __init__
    self._init_batch_counters()
  File "/usr/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 522, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 326, in _init_batch_counters
    self._train_counter = variables.Variable(0, dtype='int64', aggregation=agg)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 262, in __call__
    return cls._variable_v2_call(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 244, in _variable_v2_call
    return previous_getter(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 237, in <lambda>
    previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/variable_scope.py", line 2662, in default_variable_creator_v2
    return resource_variable_ops.ResourceVariable(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/variables.py", line 264, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1584, in __init__
    self._init_from_args(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1738, in _init_from_args
    handle = eager_safe_variable_handle(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 237, in eager_safe_variable_handle
    return _variable_handle_from_shape_and_dtype(shape, dtype, shared_name, name,
  File "/usr/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 177, in _variable_handle_from_shape_and_dtype
    cpp_shape_inference_pb2.CppShapeInferenceResult.HandleShapeAndType(
TypeError: Parameter to MergeFrom() must be instance of same class: expected tensorflow.TensorShapeProto got tensorflow.TensorShapeProto.

Comment 10 ykui 2021-07-08 10:37:04 UTC

Yeah I too currently receive the "tensorflow.TensorShapeProto got tensorflow.TensorShapeProto"-error.

At first I was not able to build against current protobuf-3.17.3 so I downgraded to 3.15.8 and built tensorflow-2.5.0 successfully. With the older version of protobuf however I got the above error.

I then upgraded protobuf to 3.17.3 but without rebuilding tensorflow. I no longer got the above error, and the test-suite I was running (the object-detection one) reported no errors:
python3 object_detection/builders/model_builder_tf2_test.py

Nonetheless portage was adamant about tensorflow needing to be rebuilt due to the protobuf update. I ran into the crosstool_wrapper_driver_is_not_gcc-error while trying to build tensorflow against the now updated protobuf-3.17.3, however (as well as some sporadic other build errors that I never saw on the next rebuild).

I believe I rebuilt cudnn and grpc(io?) and nvidia-cuda-toolkit, and after that rebuilt tensorflow-2.5.0 successfully somehow. Unfortunately, after this rebuild I am where I am now and I receive the mentioned error when running the tests.
("tensorflow.TensorShapeProto got tensorflow.TensorShapeProto")

I have tried upgrading to protobuf-9999 without rebuilding tensorflow, same error. I tried rebuilding tensorflow against protobuf-9999, still same error.

for what it is worth I have a 56GB tmpfs of /var/tmp/portage
tmpfs /var/tmp/portage tmpfs size=56G
and I can build tensorflow in memory with -j9 (I only have 8 threads so any more is pointless, and even 9 is probably a stretch).

Comment 11 ykui 2021-07-08 10:46:39 UTC

Is this patch perhaps worth trying out? It looks like it is set to be included in tensorflow-2.6

https://github.com/tensorflow/tensorflow/issues/50545#issuecomment-872307752

https://github.com/tensorflow/tensorflow/commit/95abf88e4c117f8445308c3174cc42795a6694e6

I can not start another build to try it right now (probably when I go to bed).

Comment 12 ykui 2021-07-08 11:23:21 UTC

It seems swapping around the two import lines in 
/usr/lib/python3.8/site-packages/tensorflow/python/__init__.py
that were mentioned in the github-comment so that they now say

from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
from tensorflow.python.eager import context

seems to make the module/protobuf-error go away.

Comment 13 Bjoern Olausson 2021-07-08 11:27:02 UTC

(In reply to ykui from comment #11)
> Is this patch perhaps worth trying out? It looks like it is set to be
> included in tensorflow-2.6
> 
> https://github.com/tensorflow/tensorflow/issues/50545#issuecomment-872307752
> 
> https://github.com/tensorflow/tensorflow/commit/
> 95abf88e4c117f8445308c3174cc42795a6694e6
> 
> I can not start another build to try it right now (probably when I go to
> bed).

Interesting that everything seems to work when you compile against old protobuf and then simply update protobuf to the latest version without recompiling Tensorflow...

I did rebuild my entire @world tree at some point... so I don't think it is a problem of recompiling other packages against the latest protobuf version.

I applied the above mentioned patch alongside with the example_parsing_ops
.cc fix (sed line from previous comments).

TensorFlow 2.5 is compiling now. In somewhat less than 6h we know if this patch fixed the "expected tensorflow.TensorShapeProto got tensorflow.TensorShapeProto." issue.

Comment 14 Bjoern Olausson 2021-07-08 11:28:50 UTC

(In reply to ykui from comment #12)
> It seems swapping around the two import lines in 
> /usr/lib/python3.8/site-packages/tensorflow/python/__init__.py
> that were mentioned in the github-comment so that they now say
> 
> from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
> from tensorflow.python.eager import context
> 
> seems to make the module/protobuf-error go away.

I can confirm that this works!

[...]

-> physical GPU (device: 0, name: NVIDIA GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
WARNING:tensorflow:Please add `keras.layers.InputLayer` instead of `keras.Input` to Sequential model. `keras.Input` is intended to be used by Functional model.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 74, 34, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 37, 17, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 35, 15, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 17, 7, 64)         0         
_________________________________________________________________
flatten (Flatten)            (None, 7616)              0         
_________________________________________________________________
dropout (Dropout)            (None, 7616)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                76170     
=================================================================
Total params: 94,986
Trainable params: 94,986
Non-trainable params: 0
_________________________________________________________________
2021-07-08 13:26:47.895867: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-08 13:26:47.896111: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3699850000 Hz
Epoch 1/50
2021-07-08 13:26:48.188259: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-07-08 13:26:48.740788: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8005
2021-07-08 13:26:50.748375: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-08 13:26:50.987907: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
654/654 [==============================] - 14s 15ms/step - loss: 5.4520 - accuracy: 0.9454 - val_loss: 12.9417 - val_accuracy: 0.1317
Epoch 2/50
654/654 [==============================] - 10s 15ms/step - loss: 0.1530 - accuracy: 0.9742 - val_loss: 23.6971 - val_accuracy: 0.1317
Epoch 3/50
654/654 [==============================] - 10s 15ms/step - loss: 0.4149 - accuracy: 0.9710 - val_loss: 52.9526 - val_accuracy: 0.1317
Epoch 4/50
654/654 [==============================] - 10s 15ms/step - loss: 0.4733 - accuracy: 0.9800 - val_loss: 102.9791 - val_accuracy: 0.1317
Epoch 5/50
654/654 [==============================] - 10s 15ms/step - loss: 0.1687 - accuracy: 0.9892 - val_loss: 82.1729 - val_accuracy: 0.1317
Epoch 6/50
654/654 [==============================] - 10s 15ms/step - loss: 0.3617 - accuracy: 0.9923 - val_loss: 154.5778 - val_accuracy: 0.1317
Epoch 7/50
654/654 [==============================] - 10s 15ms/step - loss: 0.0751 - accuracy: 0.9945 - val_loss: 84.9845 - val_accuracy: 0.1317
Epoch 8/50
617/654 [===========================>..] - ETA: 0s - loss: 0.0236 - accuracy: 0.9973

Comment 15 ykui 2021-07-08 12:31:28 UTC

I guess it is a different bug strictly speaking, but is tensorflow actually compatible with numpy-1.2{0,1}.x ?

I get this error 

NotImplementedError: Cannot convert a symbolic Tensor (cond_2/strided_slice:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported

when running a model -- which from searching around has been suggested to be a numpy incompatibility. 
https://stackoverflow.com/questions/66207609/notimplementederror-cannot-convert-a-symbolic-tensor-lstm-2-strided-slice0-t/66207610

Portage does not have numpy-1.19 or earlier in the tree, and if I install an earlier numpy-version like numpy-1.18 with pip, I get this error:

RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
I do not believe I can rebuild tensorflow against pip-numpy.

Comment 16 Bjoern Olausson 2021-07-08 17:13:50 UTC

(In reply to Bjoern Olausson from comment #13)
> (In reply to ykui from comment #11)
> > Is this patch perhaps worth trying out? It looks like it is set to be
> > included in tensorflow-2.6
> > 
> > https://github.com/tensorflow/tensorflow/issues/50545#issuecomment-872307752
> > 
> > https://github.com/tensorflow/tensorflow/commit/
> > 95abf88e4c117f8445308c3174cc42795a6694e6
> > 
> > I can not start another build to try it right now (probably when I go to
> > bed).
> 
> Interesting that everything seems to work when you compile against old
> protobuf and then simply update protobuf to the latest version without
> recompiling Tensorflow...
> 
> I did rebuild my entire @world tree at some point... so I don't think it is
> a problem of recompiling other packages against the latest protobuf version.
> 
> I applied the above mentioned patch alongside with the example_parsing_ops
> .cc fix (sed line from previous comments).
> 
> TensorFlow 2.5 is compiling now. In somewhat less than 6h we know if this
> patch fixed the "expected tensorflow.TensorShapeProto got
> tensorflow.TensorShapeProto." issue.

The "patch" does not fix the "TypeError: Parameter to MergeFrom() must be instance of same class: expected tensorflow.TensorShapeProto got tensorflow.TensorShapeProto.". I still have to swap the lines in /usr/lib/python3.8/site-packages/tensorflow/python/__init__.py

Comment 17 picnic.sun 2021-07-12 04:02:04 UTC

ERROR: /var/tmp/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-python3_8/tensorflow/core/kernels/BUILD:4393:18: C++ compilation of rule '//tensorflow/core/kernels:example_parsing_ops' failed (Exit 1): gcc failed: error executing command 
  (cd /var/tmp/portage/sci-libs/tensorflow-2.5.0/work/tensorflow-2.5.0-python3_8-bazel-base/execroot/org_tensorflow && \
  exec env - \
    HOME=/var/tmp/portage/sci-libs/tensorflow-2.5.0/homedir \
    KERAS_HOME=/var/tmp/portage/sci-libs/tensorflow-2.5.0/temp/.keras \
    PATH=/var/tmp/portage/sci-libs/tensorflow-2.5.0/temp/python3.8/bin:/var/tmp/portage/._portage_reinstall_.gcevqsmz/bin/ebuild-helpers/xattr:/usr/lib/portage/python3.9/ebuild-helpers/xattr:/var/tmp/portage/._portage_reinstall_.gcevqsmz/bin/ebuild-helpers:/usr/lib/portage/python3.9/ebuild-helpers:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:/usr/lib/llvm/12/bin:/usr/lib/llvm/11/bin \
    PWD=/proc/self/cwd \
    PYTHON_BIN_PATH=/usr/bin/python3.8 \
    PYTHON_LIB_PATH=/usr/lib/python3.8/site-packages \
    TF2_BEHAVIOR=1 \
    TF_SYSTEM_LIBS=absl_py,astor_archive,astunparse_archive,boringssl,com_github_googlecloudplatform_google_cloud_cpp,com_github_grpc_grpc,com_google_protobuf,curl,cython,dill_archive,double_conversion,enum34_archive,flatbuffers,functools32_archive,gast_archive,gif,hwloc,icu,jsoncpp_git,libjpeg_turbo,lmdb,nasm,nsync,opt_einsum_archive,org_sqlite,pasta,pcre,png,pybind11,six_archive,snappy,tblib_archive,termcolor_archive,typing_extensions_archive,wrapt,zlib \
  /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections -fdata-sections '-std=c++0x' -MD -MF bazel-out/k8-opt/bin/tensorflow/core/kernels/_objs/example_parsing_ops/example_parsing_ops.pic.d '-frandom-seed=bazel-out/k8-opt/bin/tensorflow/core/kernels/_objs/example_parsing_ops/example_parsing_ops.pic.o' -fPIC -DTF_USE_SNAPPY -DEIGEN_MPL2_ONLY '-DEIGEN_MAX_ALIGN_BYTES=64' -iquote . -iquote bazel-out/k8-opt/bin -iquote external/com_google_absl -iquote bazel-out/k8-opt/bin/external/com_google_absl -iquote external/nsync -iquote bazel-out/k8-opt/bin/external/nsync -iquote external/eigen_archive -iquote bazel-out/k8-opt/bin/external/eigen_archive -iquote external/gif -iquote bazel-out/k8-opt/bin/external/gif -iquote external/libjpeg_turbo -iquote bazel-out/k8-opt/bin/external/libjpeg_turbo -iquote external/com_google_protobuf -iquote bazel-out/k8-opt/bin/external/com_google_protobuf -iquote external/com_googlesource_code_re2 -iquote bazel-out/k8-opt/bin/external/com_googlesource_code_re2 -iquote external/farmhash_archive -iquote bazel-out/k8-opt/bin/external/farmhash_archive -iquote external/fft2d -iquote bazel-out/k8-opt/bin/external/fft2d -iquote external/highwayhash -iquote bazel-out/k8-opt/bin/external/highwayhash -iquote external/zlib -iquote bazel-out/k8-opt/bin/external/zlib -iquote external/double_conversion -iquote bazel-out/k8-opt/bin/external/double_conversion -iquote external/snappy -iquote bazel-out/k8-opt/bin/external/snappy -iquote external/curl -iquote bazel-out/k8-opt/bin/external/curl -iquote external/boringssl -iquote bazel-out/k8-opt/bin/external/boringssl -iquote external/jsoncpp_git -iquote bazel-out/k8-opt/bin/external/jsoncpp_git -isystem third_party/eigen3/mkl_include -isystem bazel-out/k8-opt/bin/third_party/eigen3/mkl_include -isystem external/eigen_archive -isystem bazel-out/k8-opt/bin/external/eigen_archive -isystem external/farmhash_archive/src -isystem bazel-out/k8-opt/bin/external/farmhash_archive/src -w -DAUTOLOAD_DYNAMIC_KERNELS -I/usr/include/jsoncpp '-std=c++14' '-mtune=haswell' -O2 -pipe -msse -msse2 -msse3 -msse4.1 -msse4.2 -DEIGEN_AVOID_STL_ARRAY -Iexternal/gemmlowp -Wno-sign-compare '-ftemplate-depth=900' -fno-exceptions -DINTEL_MKL -msse3 -pthread '-DINTEL_MKL=1' -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c tensorflow/core/kernels/example_parsing_ops.cc -o bazel-out/k8-opt/bin/tensorflow/core/kernels/_objs/example_parsing_ops/example_parsing_ops.pic.o)
Execution platform: @local_execution_config_platform//:platform
In file included from ./tensorflow/core/framework/op_kernel.h:35,
                 from ./tensorflow/core/framework/numeric_op.h:19,
                 from tensorflow/core/kernels/example_parsing_ops.cc:27:
tensorflow/core/kernels/example_parsing_ops.cc: In member function 'virtual void tensorflow::DecodeJSONExampleOp::Compute(tensorflow::OpKernelContext*)':
tensorflow/core/kernels/example_parsing_ops.cc:1221:57: error: 'class google::protobuf::util::status_internal::Status' has no member named 'error_message'; did you mean 'error_message_'?
 1221 |                                           string(status.error_message())));
      |                                                         ^~~~~~~~~~~~~
./tensorflow/core/framework/op_requires.h:45:46: note: in definition of macro 'OP_REQUIRES'
   45 |       (CTX)->CtxFailure(__FILE__, __LINE__, (STATUS));    \
      |                                              ^~~~~~
tensorflow/core/kernels/example_parsing_ops.cc:1221:57: error: 'std::string google::protobuf::util::status_internal::Status::error_message_' is private within this context
 1221 |                                           string(status.error_message())));
      |                                                         ^~~~~~~~~~~~~
./tensorflow/core/framework/op_requires.h:45:46: note: in definition of macro 'OP_REQUIRES'
   45 |       (CTX)->CtxFailure(__FILE__, __LINE__, (STATUS));    \
      |                                              ^~~~~~
In file included from /usr/include/google/protobuf/stubs/logging.h:36,
                 from /usr/include/google/protobuf/io/coded_stream.h:150,
                 from bazel-out/k8-opt/bin/tensorflow/core/protobuf/error_codes.pb.h:23,
                 from ./tensorflow/core/platform/status.h:30,
                 from ./tensorflow/core/lib/core/status.h:19,
                 from ./tensorflow/core/lib/monitoring/counter.h:37,
                 from ./tensorflow/core/framework/metrics.h:19,
                 from ./tensorflow/core/common_runtime/metrics.h:22,
                 from tensorflow/core/kernels/example_parsing_ops.cc:23:
/usr/include/google/protobuf/stubs/status.h:97:15: note: declared private here
   97 |   std::string error_message_;
      |               ^~~~~~~~~~~~~~
INFO: Elapsed time: 12507.846s, Critical Path: 275.08s
INFO: 4726 processes: 346 internal, 4380 local.
FAILED: Build did NOT complete successfully

Comment 18 Bjoern Olausson 2021-07-14 19:13:01 UTC

Created attachment 723883 [details]
tensorflow-2.5.0-r1.ebuild

For the time to the next release, I created a ebuild (tensorflow-2.5.0-r1.ebuild) + new patch (StatusMessage_TypeError.patch) to address this bug.

Cheers,
Bjoern

Comment 19 Bjoern Olausson 2021-07-14 19:13:55 UTC

Created attachment 723886 [details, diff]
StatusMessage_TypeError.patch

Patch required for tensorflow-2.5.0-r1.ebuild

Comment 20 Toralf Förster gentoo-dev

2021-07-26 19:48:34 UTC

*** Bug 802660 has been marked as a duplicate of this bug. ***

Comment 21 Toralf Förster gentoo-dev

2021-07-26 19:48:48 UTC

*** Bug 804564 has been marked as a duplicate of this bug. ***

Comment 22 Matthew Smith gentoo-dev

2021-07-31 17:02:38 UTC

*** Bug 805305 has been marked as a duplicate of this bug. ***

Comment 23 Larry the Git Cow gentoo-dev

2021-08-01 13:19:43 UTC

The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=82a04476b0dd7ea1628495350b035908e72c1d94

commit 82a04476b0dd7ea1628495350b035908e72c1d94
Author:     Jason Zaman <perfinion@gentoo.org>
AuthorDate: 2021-08-01 13:13:54 +0000
Commit:     Jason Zaman <perfinion@gentoo.org>
CommitDate: 2021-08-01 13:19:12 +0000

    sci-libs/tensorflow: Add python3_9 and build against proto-3.16
    
    Protobuf 3.16 changed the status API in
    https://github.com/protocolbuffers/protobuf/commit/59ea5c8f19de47dc15cbce2e2e97d9de01d50fb9
    so must be patched. All deps now support python3_9 as well so enable
    support in TF
    
    Closes: https://bugs.gentoo.org/800824
    Closes: https://bugs.gentoo.org/802732
    Package-Manager: Portage-3.0.20, Repoman-3.0.2
    Signed-off-by: Jason Zaman <perfinion@gentoo.org>

 sci-libs/tensorflow/Manifest                   |   1 +
 sci-libs/tensorflow/tensorflow-2.5.0-r1.ebuild | 410 +++++++++++++++++++++++++
 2 files changed, 411 insertions(+)