aim 如何知道停止跟踪的原因,

rkkpypqq  于 2个月前  发布在  其他
关注(0)|答案(4)|浏览(41)

你好。我正在尝试将AIM服务器作为Docker容器使用,并在另一个容器中运行机器学习代码。这两个容器通过Docker桥接网络连接。目前我遇到了两个问题。

第一个问题是无法同时进行两次以上的训练。有些运行中的程序会随机停止,并抛出类似这样的错误:

Traceback (most recent call last):
  File "train.py", line 59, in main
    aim_logger.log_hyperparams(cfg)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py", line 48, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/appuser/tiamo/tiamo/utils/aim/pytorch_lightning.py", line 97, in log_hyperparams
    self.experiment.set(('hparams', key), value, strict=False)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loggers/base.py", line 42, in experiment
    return get_experiment() or DummyExperiment()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py", line 48, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loggers/base.py", line 40, in get_experiment
    return fn(self)
  File "/home/appuser/tiamo/tiamo/utils/aim/pytorch_lightning.py", line 73, in experiment
    self._run = Run(
  File "/usr/local/lib/python3.8/dist-packages/aim/sdk/run.py", line 437, in __init__
    self.meta_tree: TreeView = self.repo.request_tree(
  File "/usr/local/lib/python3.8/dist-packages/aim/sdk/repo.py", line 312, in request_tree
    return ProxyTree(self._client, name, sub, read_only, from_union)
  File "/usr/local/lib/python3.8/dist-packages/aim/storage/treeviewproxy.py", line 42, in __init__
    handler = self._rpc_client.get_resource_handler('TreeView', args=args)
  File "/usr/local/lib/python3.8/dist-packages/aim/ext/transport/client.py", line 63, in get_resource_handler
    response = self.remote.get_resource(request)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1654246083.032662701","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3093,"referenced_errors":[{"created":"@1654246083.032660970","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>

另一个问题更为关键。即使在只记录单次运行的情况下,跟踪器在训练过程中的某个时刻断开连接(没有留下任何错误),我只能通过AIM UI知道跟踪器已经停止。所以我的问题是,是否有任何日志文件可以帮助我找出为什么跟踪器会停止?

谢谢。

e5nqia27

e5nqia271#

mihran113,你能看一下这个吗?这是否与远程服务器或grpc问题有关?

ev7lccsx

ev7lccsx2#

嘿,@hcw-00!你描述的第一个问题表明客户端无法连接到服务器。可能的原因有:服务器宕机,或者根据你的设置,连接数有限制。
对于第二个问题,不幸的是,没有其他日志文件可以指示失败的原因。我能请你提供一些关于设置(docker文件等)的更多信息吗?这样我才能在我这边重现它,并提供一个客户端脚本失败的例子?

yv5phkfx

yv5phkfx3#

感谢您!
这里是我的Dockerfile:

FROM aimstack/aim:lastest
ENTRYPOINT []

这些是运行命令。

docker run -it -v /home/changwoo/models/aim_repo:/opt/aim aim:0.0.1 -n aim_server /bin/bash
$ aim server
---
docker run -it -v /home/changwoo/models/aim_repo:/opt/aim -p 43800:43800 --network="host" aim:0.0.1 /bin/bash
$ aim up

网络设置

docker network create aim-net
docker network connect aim-net aim_server
docker network connect aim-net my_ml_container
---
# in my training code
aim_logger = AimLogger(repo='aim://192.168.144.2:53800', ...)

不幸的是,我不允许分享失败发生的代码。但是,我会尝试用另一个脚本重现它,如果重现成功,我会与您分享。

m1m5dgzv

m1m5dgzv4#

嘿,@hcw-00!
只是想跟进一下,如果你能以其他方式复现它并分享出来吗?
我已经尝试了这个设置,在我这边似乎运行得很好,我能与多个客户端连接,并且在没有意外停止的情况下进行相当长时间的训练。

相关问题