aim 在使用fairseq时遇到问题

vom3gejh  于 25天前  发布在  其他
关注(0)|答案(4)|浏览(23)

❓问题

图书馆fairseq内置了对aim的支持,但我正在努力使其正常工作。我不确定是我做错了什么,还是可能fairseq支持已经过时了,但fairseq仓库相当不活跃,所以我想在这里问一下。
我在本地运行aim server,看到:"服务器已挂载在0.0.0.0:53800"。
然后我运行我的fairseq实验,在我的config.yaml文件中添加以下内容:

common:
   aim_repo: aim://0.0.0.0:53800

然后运行我的实验。它似乎一开始就能正常工作——aim检测到实验,日志开头是:

[2023-11-15 14:31:07,453][fairseq.logging.progress_bar][INFO] - Storing logs at Aim repo: aim://0.0.0.0:53800
[2023-11-15 14:31:07,480][aim.sdk.reporter][INFO] - creating RunStatusReporter for f6f19ecf0e2147b19e24d52f
[2023-11-15 14:31:07,482][aim.sdk.reporter][INFO] - starting from: {}
[2023-11-15 14:31:07,482][aim.sdk.reporter][INFO] - starting writer thread for <aim.sdk.reporter.RunStatusReporter object at 0x7f57117363e0>
[2023-11-15 14:31:08,471][fairseq.trainer][INFO] - begin training epoch 1
[2023-11-15 14:31:08,471][fairseq_cli.train][INFO] - Start iterating over samples
[2023-11-15 14:31:10,821][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
[2023-11-15 14:31:12,261][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
[2023-11-15 14:31:12,261][fairseq_cli.train][INFO] - begin validation on "valid" subset
[2023-11-15 14:31:12,266][fairseq.logging.progress_bar][INFO] - Storing logs at Aim repo: aim://0.0.0.0:53800
[2023-11-15 14:31:12,283][fairseq.logging.progress_bar][INFO] - Appending to run: f6f19ecf0e2147b19e24d52f

但然后我遇到了一个错误:

...
  File "/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 64, in progress_bar
    bar = AimProgressBarWrapper(
  File "/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 365, in __init__
    self.run = get_aim_run(aim_repo, aim_run_hash)
  File "/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 333, in get_aim_run
    return Run(run_hash=run_hash, repo=repo)
  File "/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
    _SafeModeConfig.exception_callback(e, func)
  File "/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
    raise e
  File "/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
  File "/lib/python3.10/site-packages/aim/sdk/run.py", line 828, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/lib/python3.10/site-packages/aim/sdk/run.py", line 276, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/lib/python3.10/site-packages/aim/sdk/base_run.py", line 50, in __init__
    self._lock.lock(force=force_resume)
  File "/lib/python3.10/site-packages/aim/storage/lock_proxy.py", line 38, in lock
    return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
  File "/lib/python3.10/site-packages/aim/ext/transport/client.py", line 260, in run_instruction
    return self._run_read_instructions(queue_id, resource, method, args)
  File "/lib/python3.10/site-packages/aim/ext/transport/client.py", line 285, in _run_read_instructions
    raise_exception(status_msg.header.exception)
  File lib/python3.10/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'
Exception in thread Thread-13 (worker):
Traceback (most recent call last):
  File "lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 55, in worker
    if self._try_exec_task(task_f, *args):
  File "/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 81, in _try_exec_task
    task_f(*args)
  File "/lib/python3.10/site-packages/aim/ext/transport/client.py", line 301, in _run_write_instructions
    raise_exception(response.exception)
  File "/python3.10/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
aim.ext.transport.message_utils.UnauthorizedRequestError: 3310c526-aa51-47ef-ba87-fbf75f80f610

有人知道这可能是为什么吗/或者我采取的方法有误吗?我已经尝试了各种不同的aim版本(回到fairseq更积极开发时的版本),但仍然出现错误。

insrf1ej

insrf1ej1#

将@tmynn添加到此线程中,因为他已经将集成在一起。

pw9qyyiw

pw9qyyiw2#

@SGevorg, @henrycharlesworth,似乎这条线指向了真正的错误:

TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'

@henrycharlesworth,你使用的是Aim的最新版本吗?

ep6jt1vc

ep6jt1vc3#

我认为是这样——使用3.17.5版本。我尝试过一些较早的版本,但似乎没有帮助。

fcwjkofz

fcwjkofz4#

是否有解决此问题的方法?当我尝试使用哈希检索现有运行时,我一直收到这个错误。

相关问题