tensorflow 在TPU V3-8上使用XLA出现错误:RPC以"Unavailable: Socket closed"的状态失败,

t98cgbkg  于 4个月前  发布在  其他
关注(0)|答案(2)|浏览(72)

系统信息

  • 是否编写了自定义代码(与在TensorFlow中使用提供的库存示例脚本相反):我正在使用RoBERTa Pytorch-XLA https://cloud.google.com/tpu/docs/tutorials/roberta-pytorch的XLA示例。
  • OS平台和发行版(例如,Linux Ubuntu 16.04):debian-9-torch-xla-v20201225(GCP镜像)
  • GCP机器类型:自定义8vCPUs,256GB内存
  • 从哪里安装了TensorFlow(源代码或二进制文件):在GCP镜像上提供
  • TensorFlow版本(请使用以下命令):torch-xla-1.7
  • Python版本:Python 3.6.10 :: Anaconda,Inc.
  • Bazel版本(如果从源代码编译):N/A
  • GCC/编译器版本(如果从源代码编译):N/A
  • CUDA/cuDNN版本:N/A
  • GPU型号和内存:TPU V3-8
    描述当前行为

在使用Pythorch-XLA进行一段时间的训练后,出现了以下错误:

2020-12-28 01:56:05.252085: W    1417 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.251970000","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC

*** Begin stack trace ***
        tensorflow::CurrentStackTrace()
        xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::string const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
        xla::XrtComputationClient::HandleReleaser()
        xla::util::TriggeredTask::Runner()

        clone
*** End stack trace ***

我是按照这里描述的步骤进行操作的,使用了相同的网络参数,只是数据集不同。
之前遇到过这个问题,但是因为在恢复检查点后VM上的OOM导致的,这就是为什么我增加了VM内存的原因。
似乎TPU不知何故被抢占了,但我没有访问运行时日志,因为这个错误是在夜间发生的,而且TFRC自动删除了它。

描述预期行为

训练应该像预期的那样继续进行。

独立代码以重现问题

https://cloud.google.com/tpu/docs/tutorials/roberta-pytorch
训练数据约为40GB。

其他信息/日志

包括任何有助于诊断问题的日志或源代码。如果包括回溯,请包括完整的回溯。大型日志和文件应附加。

| epoch 002 | training on xla:0/1:   4151 / 10099 loss=1.804, nll_loss=1.804, wps=17811, ups=0, wpb=117675.783, bsz=296.068, num_updates=14249, lr=0.000491148, gnorm=0.345, oom=0.000, wall=27948, train_wall=92610, now=01:54:20
| epoch 002 | training on xla:0/7:   4151 / 10099 loss=1.805, nll_loss=1.805, wps=17811, ups=0, wpb=117678.381, bsz=296.074, num_updates=14249, lr=0.000491148, gnorm=0.345, oom=0.000, wall=27948, train_wall=92609, now=01:54:20
| epoch 002 | training on xla:0/3:   4151 / 10099 loss=1.805, nll_loss=1.805, wps=17810, ups=0, wpb=117668.251, bsz=296.137, num_updates=14249, lr=0.000491148, gnorm=0.345, oom=0.000, wall=27949, train_wall=92608, now=01:54:20
2020-12-28 01:56:05.252059: W    1436 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.251908254","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252085: W    1417 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.251970000","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252085: W    1416 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.251940620","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252162: W    1379 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252025037","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252205: W    1438 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252117134","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252251: W    1465 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252143973","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252279: W    1483 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252130522","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252398: W    1464 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252301762","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252431: W    1428 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252333413","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252452: W    1341 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252361631","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252472: W    1400 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252380523","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252456: W    1345 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252299434","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252541: W    1378 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252405772","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252553: W    1423 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252493206","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252618: W    1397 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252489431","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-12-28 01:56:05.252674: W    1480 tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1609120565.252561700","description":"Error received from peer ipv4:10.180.83.2:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
terminate called after throwing an instance of 'std::runtime_error'
  what():  tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1110 : Check failed: session->session()->Run( feed_inputs, {}, {cached_node.operations[0]}, &outputs) == ::tensorflow::Status::OK() (Aborted: Session a57840b79b1bd972 is not found. vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace()
        xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::string const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
        xla::XrtComputationClient::HandleReleaser()
        xla::util::TriggeredTask::Runner()

        clone
*** End stack trace ***

目前,我在另一个TPU节点上恢复了训练,但检查内存使用情况时,似乎每个训练步骤都在增加。是否可能是TPU上发生了OOM并变得不可用?

mi7gmzs6

mi7gmzs61#

你好,
感谢你打开这个问题。由于这个问题已经开放了很长时间,这个问题的代码/调试信息可能与当前代码库的状态不相关。
Tensorflow团队正在不断通过修复错误和添加新功能来改进框架。我们建议你尝试使用最新的TensorFlow version 和最新的兼容硬件配置,这可能会解决该问题。如果你仍然遇到问题,请创建一个新的GitHub问题,附上你的最新发现以及所有有助于我们调查的调试信息。
请按照 release notes 了解Tensorflow空间中最新发展的动态。

6pp0gazn

6pp0gazn2#

你好,soares-f。我遇到了相同的问题(尽管使用的是自定义模型),每个epoch的TPU内存都有类似的增加。
你最终弄清楚原因了吗?

相关问题