收集的Tensorflow配置文件在Tensorboard中无法识别,

icomxhvb  于 6个月前  发布在  其他
关注(0)|答案(5)|浏览(46)

问题类型

Bug

你是否在 TensorFlow 的夜间版本中重现了这个 bug?

是的

问题来源

source

Tensorflow 版本

tf 2.13.0-dev20230406

自定义代码

OS 平台和发行版

Linux Ubuntu 22.04

移动设备

Python 版本

3.9.16

Bazel 版本

GCC/编译器版本

CUDA/cuDNN 版本

11.8.0/8.6.0.163

GPU 型号和内存大小

NVIDIA GeForce RTX 3080 Ti 12GiB

当前行为?

我正在遵循 https://www.tensorflow.org/tutorials/quickstart/beginner 上的教程。根据 https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras 中的说明修改了代码,以便在训练过程中启用一系列批次的分析。有了这个改变,训练似乎正常进行,日志显示创建了一个分析器会话并收集了分析结果。日志目录中有一个非空的 plugins/profile/<date>/<host>.xplane.pb 文件。但是当我在日志上运行 tensorboard(无论是主版本还是 tb-nightly)时,它无法检测到分析(UI 中的 Profile 标签缺失)。我还确认先运行了 pip install -U tensorboard-plugin-profile。我本以为会出现以下两种情况之一:要么 tensorboard 向我展示分析结果,要么如果在收集或显示分析结果时出现问题,错误消息会提示我以便修复问题。

独立代码以重现问题

# The code is at https://www.tensorflow.org/tutorials/quickstart/beginner
# I change the model.fit() call to use the Tensorboard callback to collect a profile:

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir=log_dir,
    histogram_freq=1,
    profile_batch=(500, 600))
model.fit(x_train, y_train, epochs=5, callbacks=[tensorboard_callback])

相关日志输出

2023-04-06 23:17:28.048863: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-04-06 23:17:28.048880: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2023-04-06 23:17:28.048915: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1671] Profiler found 1 GPUs
2023-04-06 23:17:28.237604: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2023-04-06 23:17:28.237742: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed
Epoch 1/5
2023-04-06 23:17:28.747772: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f08c0180cf0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-04-06 23:17:28.747785: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti, Compute Capability 8.6
2023-04-06 23:17:28.751189: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-04-06 23:17:28.834436: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:426] Loaded cuDNN version 8600
2023-04-06 23:17:28.868033: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-04-06 23:17:28.900180: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
 522/1875 [=======>......................] - ETA: 5s - loss: 0.4875 - accuracy: 0.8590
2023-04-06 23:17:30.991051: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-04-06 23:17:30.991106: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
 645/1875 [=========>....................] - ETA: 4s - loss: 0.4499 - accuracy: 0.8701
2023-04-06 23:17:31.542500: I tensorflow/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2023-04-06 23:17:31.545123: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed
2023-04-06 23:17:31.570874: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_collector.cc:541]  GpuTracer has collected 6158 callback api events and 5891 activity events. 
2023-04-06 23:17:31.598454: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
1875/1875 [==============================] - 8s 4ms/step - loss: 0.3017 - accuracy: 0.9121
Epoch 2/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.1441 - accuracy: 0.9570
Epoch 3/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.1075 - accuracy: 0.9685
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0878 - accuracy: 0.9732
Epoch 5/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0737 - accuracy: 0.9771
carvr3hs

carvr3hs1#

看起来我们遇到了类似的问题。@stefanbucur 你能在 plugins 所在的目录中找到 events.out.tfevents.* 吗?

1rhkuytd

1rhkuytd2#

Tf配置对我来说运行良好。但是我在TensorBoard中无法看到任何损失值。

wwwo4jvm

wwwo4jvm3#

是的,这里是完整的目录内容:

$ ls -R logs/
logs/:
fit

logs/fit:
20230417-090906

logs/fit/20230417-090906:
plugins  train

logs/fit/20230417-090906/plugins:
profile

logs/fit/20230417-090906/plugins/profile:
2023_04_17_09_09_10

logs/fit/20230417-090906/plugins/profile/2023_04_17_09_09_10:
saturn.xplane.pb

logs/fit/20230417-090906/train:
events.out.tfevents.1681736947.saturn.73263.0.v2
r1zk6ea1

r1zk6ea14#

我一定做对了什么事,现在它运行正常。

2skhul33

2skhul335#

更新:我成功地让它在某些条件下工作了。
根据 the tutorial 中的说明,如果我执行 $ tensorboard --logdir logs/ 命令,性能分析将无法正常工作。
然而,如果我在包含性能分析的特定子目录($ tensorboard --logdir logs/fit/20230417-090906/)上运行 tensorboard ,那么性能分析将会显示出来,但只有在我刷新一次浏览器窗口之后。一旦我刷新了浏览器窗口并访问性能分析选项卡,*.hlo_proto.pb 文件就会开始出现在原始的 *.xplane.pb 性能分析文件旁边。
请注意,在原始情况下(打开顶级 logs/ 目录),刷新窗口并不能解决问题。
这只是一个猜测,但也许 tensorboard 在遍历日志目录树时出现了回归?

相关问题