问题类型
Bug
你是否在 TensorFlow 的夜间版本中重现了这个 bug?
是的
问题来源
source
Tensorflow 版本
tf 2.13.0-dev20230406
自定义代码
否
OS 平台和发行版
Linux Ubuntu 22.04
移动设备
无
Python 版本
3.9.16
Bazel 版本
无
GCC/编译器版本
无
CUDA/cuDNN 版本
11.8.0/8.6.0.163
GPU 型号和内存大小
NVIDIA GeForce RTX 3080 Ti 12GiB
当前行为?
我正在遵循 https://www.tensorflow.org/tutorials/quickstart/beginner 上的教程。根据 https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras 中的说明修改了代码,以便在训练过程中启用一系列批次的分析。有了这个改变,训练似乎正常进行,日志显示创建了一个分析器会话并收集了分析结果。日志目录中有一个非空的 plugins/profile/<date>/<host>.xplane.pb
文件。但是当我在日志上运行 tensorboard(无论是主版本还是 tb-nightly)时,它无法检测到分析(UI 中的 Profile 标签缺失)。我还确认先运行了 pip install -U tensorboard-plugin-profile
。我本以为会出现以下两种情况之一:要么 tensorboard 向我展示分析结果,要么如果在收集或显示分析结果时出现问题,错误消息会提示我以便修复问题。
独立代码以重现问题
# The code is at https://www.tensorflow.org/tutorials/quickstart/beginner
# I change the model.fit() call to use the Tensorboard callback to collect a profile:
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir,
histogram_freq=1,
profile_batch=(500, 600))
model.fit(x_train, y_train, epochs=5, callbacks=[tensorboard_callback])
相关日志输出
2023-04-06 23:17:28.048863: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-04-06 23:17:28.048880: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2023-04-06 23:17:28.048915: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1671] Profiler found 1 GPUs
2023-04-06 23:17:28.237604: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2023-04-06 23:17:28.237742: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed
Epoch 1/5
2023-04-06 23:17:28.747772: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f08c0180cf0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-04-06 23:17:28.747785: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti, Compute Capability 8.6
2023-04-06 23:17:28.751189: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-04-06 23:17:28.834436: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:426] Loaded cuDNN version 8600
2023-04-06 23:17:28.868033: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-04-06 23:17:28.900180: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
522/1875 [=======>......................] - ETA: 5s - loss: 0.4875 - accuracy: 0.8590
2023-04-06 23:17:30.991051: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-04-06 23:17:30.991106: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
645/1875 [=========>....................] - ETA: 4s - loss: 0.4499 - accuracy: 0.8701
2023-04-06 23:17:31.542500: I tensorflow/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2023-04-06 23:17:31.545123: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed
2023-04-06 23:17:31.570874: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_collector.cc:541] GpuTracer has collected 6158 callback api events and 5891 activity events.
2023-04-06 23:17:31.598454: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
1875/1875 [==============================] - 8s 4ms/step - loss: 0.3017 - accuracy: 0.9121
Epoch 2/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.1441 - accuracy: 0.9570
Epoch 3/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.1075 - accuracy: 0.9685
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0878 - accuracy: 0.9732
Epoch 5/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0737 - accuracy: 0.9771
5条答案
按热度按时间carvr3hs1#
看起来我们遇到了类似的问题。@stefanbucur 你能在
plugins
所在的目录中找到events.out.tfevents.*
吗?1rhkuytd2#
Tf配置对我来说运行良好。但是我在TensorBoard中无法看到任何损失值。
wwwo4jvm3#
是的,这里是完整的目录内容:
r1zk6ea14#
我一定做对了什么事,现在它运行正常。
2skhul335#
更新:我成功地让它在某些条件下工作了。
根据 the tutorial 中的说明,如果我执行
$ tensorboard --logdir logs/
命令,性能分析将无法正常工作。然而,如果我在包含性能分析的特定子目录(
$ tensorboard --logdir logs/fit/20230417-090906/
)上运行tensorboard
,那么性能分析将会显示出来,但只有在我刷新一次浏览器窗口之后。一旦我刷新了浏览器窗口并访问性能分析选项卡,*.hlo_proto.pb
文件就会开始出现在原始的*.xplane.pb
性能分析文件旁边。请注意,在原始情况下(打开顶级
logs/
目录),刷新窗口并不能解决问题。这只是一个猜测,但也许 tensorboard 在遍历日志目录树时出现了回归?