ludwig 比较分类器性能 - 预测结果不支持HDF5格式的地面真实值文件,

0h4hbjxa  于 2个月前  发布在  其他
关注(0)|答案(6)|浏览(31)

描述错误

可视化 compare_classifiers_performance_from_pred 无法正常工作,因为出现了以下错误:
ValueError: hdf5 is not supported for ground truth file, valid types are {'stata', 'dataframe', <class 'dask.dataframe.core.DataFrame'>, 'html', 'df', 'tsv', 'json', 'jsonl', <class 'pandas.core.frame.DataFrame'>, 'orc', 'parquet', 'sas', 'fwf', 'feather', 'csv', 'spss', 'excel', 'pickle'}
根据文档,参数 ground_truth 应该是在训练预处理过程中获得的 HDF5 文件的名称。
文档:https://ludwig.ai/latest/user_guide/visualizations/#compare_classifiers_performance_from_pred

重现问题

重现问题的步骤:

  1. 转到 Google Colab 并生成一些训练 + 预测数据。
  2. 生成一个可视化:
!ludwig visualize --visualization compare_classifiers_performance_from_pred \
  --predictions predictions_20230827_183245.csv \
  --ground_truth train.hdf5 \
  --ground_truth_metadata 1dbf206244e911ee93d40242ac1c000c.meta.json \
  --output_feature_name MyTarget
  1. 查看错误
Traceback (most recent call last):
  File "/usr/local/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 191, in main
    CLI()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 71, in __init__
    getattr(self, args.command)()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 116, in visualize
    visualize.cli(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 4172, in cli
    vis_func(**vars(args))
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 469, in compare_classifiers_performance_from_pred_cli
    ground_truth = _extract_ground_truth_values(ground_truth, output_feature_name, ground_truth_split, split_file)
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 264, in _extract_ground_truth_values
    ground_truth_df = _get_ground_truth_df(ground_truth) if isinstance(ground_truth, str) else ground_truth
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 233, in _get_ground_truth_df
    raise ValueError(
ValueError: hdf5 is not supported for ground truth file, valid types are {'stata', 'dataframe', <class 'dask.dataframe.core.DataFrame'>, 'html', 'df', 'tsv', 'json', 'jsonl', <class 'pandas.core.frame.DataFrame'>, 'orc', 'parquet', 'sas', 'fwf', 'feather', 'csv', 'spss', 'excel', 'pickle'}

预期行为

像文档中的一些图表一样。

环境

  • OS: Google Colab - Linux Ubuntu
  • Python 版本:3.10
  • Ludwig 版本:0.8.1.post1
    附加上下文

我尝试在 ludwig/utils/data_utils.py 中查找错误,但它看起来很好。我还尝试直接从 Jupyter Notebook (compare_classifiers_performance_from_pred_cli) 调用,但仍然出现相同的错误。

lawou6xi

lawou6xi1#

嘿,@iflow,感谢你报告这个问题!看起来我们似乎在这次检查中将HDF5排除在了有效文件格式的列表之外。你能尝试使用#3557中的更改运行并告诉我是否解决了问题吗?

2eafrhcq

2eafrhcq2#

你好,iflow。请确认一下这个修复是否解决了问题,如果没问题的话,我们就可以将我们的修复合并进去了!

vs3odd8k

vs3odd8k3#

感谢快速的修复!不幸的是,我还没有尝试,因为我是在Google Colab上安装的库。所以我必须把我的本地机器设置好,这可能需要一些时间。

dy2hfwbg

dy2hfwbg4#

嘿,@iflow ,在协作中,你可以像这样安装Ludwig来测试分支:

!pip install "git+https://github.com/ludwig-ai/ludwig.git@fix-gt-formats#egg=ludwig[llm]" --quiet
y53ybaqx

y53ybaqx5#

谢谢你 @tgaddair,我不知道这个很棒的命令 :)
使用固定版本后,错误 "hd5 不支持..." 不再出现👍
然而,出现了一个不同的错误:
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
我猜它与这个问题无关?
完整跟踪:

Traceback (most recent call last):
  File "/usr/local/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 191, in main
    CLI()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 71, in __init__
    getattr(self, args.command)()
  File "/usr/local/lib/python3.10/dist-packages/ludwig/cli.py", line 116, in visualize
    visualize.cli(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 4175, in cli
    vis_func(**vars(args))
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 475, in compare_classifiers_performance_from_pred_cli
    predictions_per_model = _get_cols_from_predictions(predictions, [col], metadata)
  File "/usr/local/lib/python3.10/dist-packages/ludwig/visualize.py", line 305, in _get_cols_from_predictions
    pred_df = pd.read_parquet(predictions_path)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py", line 503, in read_parquet
    return impl.read(
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py", line 251, in read
    result = self.api.parquet.read_table(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/__init__.py", line 2780, in read_table
    dataset = _ParquetDatasetV2(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/__init__.py", line 2368, in __init__
    [fragment], schema=schema or fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 898, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
oug3syen

oug3syen6#

嘿,@iflow,对于--predictions,你能尝试使用Ludwig生成的parquet文件而不是CSV吗?应该在同一个文件夹里有一个叫做类似predictions_20230827_183245.parquet的文件。

相关问题