aim 在关闭运行时缺少.ldb文件

2w3kk1z5  于 2个月前  发布在  其他
关注(0)|答案(3)|浏览(45)

🐛 Bug

当我运行 aim runs close <hash> 时,我收到以下错误:
Closing runs: 0%| | 0/1 [00:00<?, ?it/s] Traceback (most recent call last): File "/usr/local/bin/aim", line 8, in <module> sys.exit(cli_entry_point()) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/lib/python3.8/site-packages/aim/cli/runs/commands.py", line 159, in close_runs for _ in tqdm.tqdm( File "/usr/local/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__ for obj in iterable: File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 868, in next raise value File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/usr/local/lib/python3.8/site-packages/aim/sdk/repo.py", line 974, in _close_run index_manager.index(run_hash) File "/usr/local/lib/python3.8/site-packages/aim/sdk/index_manager.py", line 174, in index meta_run_tree.finalize(index=index) File "aim/storage/containertreeview.py", line 30, in aim.storage.containertreeview.ContainerTreeView.finalize File "aim/storage/prefixview.py", line 66, in aim.storage.prefixview.PrefixView.finalize File "aim/storage/rockscontainer.pyx", line 165, in aim.storage.rockscontainer.RocksContainer.finalize File "aim/storage/rockscontainer.pyx", line 413, in aim.storage.rockscontainer.RocksContainer.items File "aim/storage/rockscontainer.pyx", line 593, in aim.storage.rockscontainer.RocksContainerItemsIterator.__init__ File "src/aimrocks/lib_rocksdb.pyx", line 2338, in aimrocks.lib_rocksdb.BaseIterator.seek File "src/aimrocks/lib_rocksdb.pyx", line 2342, in aimrocks.lib_rocksdb.BaseIterator.seek File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status aimrocks.errors.RocksIOError: b'IO error: No such file or directory: While open a file for random read: /aim/.aim/meta/chunks/59f8474b38c647b4acf741d4/000512.ldb: No such file or directory'
我在那个目录中检查过,但是没有找到 .ldb 文件。对于其他运行哈希值,也没有其他 .ldb 文件。我该如何替换它们?

重现问题

通过一个新的终端窗口,运行以下命令:
cd aim
aim runs close <hash_of_run>

预期行为

通常情况下,运行会正常结束。

环境

  • Aim 版本(3.17.4)
  • 操作系统(例如:Linux)
uurv41yg

uurv41yg1#

嘿,@jennifer12121!当进程被杀死时,由于某个写操作正在执行过程中,rocksdb可能会损坏。因此问题不在于close命令本身。我可以请你提供一些关于Run对象创建/停止过程的更多详细信息吗?
另外,关于.ldb文件,这是rocksdb特定的错误,当它找不到.sst文件时,它会开始搜索.ldb文件。

lsmepo6l

lsmepo6l2#

你好,@mihran113 !感谢你的反馈。我会深入研究如何关闭rocksdb的写操作。
你能稍微多解释一下你希望从Run对象中获取哪些信息吗?到目前为止,我尝试通过CLI和SDK连接来关闭运行。对于CLI,我运行了aim runs close 59f8474b38c647b4acf741d4,而通过SDK,我运行了:

import aim
run = aim.Run(run_hash = 59f8474b38c647b4acf741d4, repo = <repo_url>)
run.close()

通过CLI,我收到了我发布的那个错误,而通过SDK没有抛出错误,但是当我运行repo.list_active_runs()时,哈希值仍然存在。

rhfm7lfc

rhfm7lfc3#

我正在寻找更多关于如何创建运行以及该过程是否被 SIGKILLSIGTERM 信号终止的信息。
例如,通过 slurm 创建并发运行时也存在类似的问题:
#2869

相关问题