tensorflow 无法识别平台GPU的NUMA节点

hrirmatl  于 2023-01-13  发布在  其他
关注(0)|答案(3)|浏览(349)

我尝试在计算机上启动Tensorflow,但总是遇到“无法识别NUMA节点”错误消息。
我使用的是Conda环境:

  • tensorflow -图形处理单元1.12.0
  • 库达工具包9.0
  • 客户7.1.2
  • nvidia-smi说:驱动程序版本418.43,CUDA版本10.1

下面是错误代码:

>>> import tensorflow as tf
>>> tf.Session()
2019-04-04 09:56:59.851321: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-04 09:56:59.950066: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2019-04-04 09:56:59.950762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 750 Ti major: 5 minor: 0 memoryClockRate(GHz): 1.0845
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.84GiB
2019-04-04 09:56:59.950794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-04 09:59:45.338767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-04 09:59:45.338799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-04-04 09:59:45.338810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-04-04 09:59:45.339017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1193] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

不幸的是,我不知道该怎么处理错误代码。

c3frrgcw

c3frrgcw1#

我可以用一个新的康达环境来修复它:

conda create --name tf python=3
conda activate tf
conda install cudatoolkit=9.0 tensorflow-gpu=1.11.0

这里有一个兼容CUDA/TF组合的表格。在我的例子中,cudatoolkit=9.0和tensorflow-gpu=1.12的组合莫名其妙地导致了std::bad_alloc错误。然而,cudatoolkit=9.0和tensorflow-gpu=1.11.0的组合工作正常。

idfiyjo8

idfiyjo82#

我也遇到了同样的问题,最后我发现这是因为你使用了Adam来优化模型,一旦你使用了另一个优化器,它应该会工作。

omtl5h9j

omtl5h9j3#

如果您在Mac计算机上遇到此错误,并且错误消息包含此行Metal device set to: Apple M1 or any other chip,则uninstall tensorflow-metal将解决错误。
管道卸载张流金属

相关问题