unstructured 清理警告表变压器警告语句语句

hpxqektj 于 8个月前发布在其他

关注(0)|答案(8)|浏览(96)

使用unstructured中的表格转换器产生以下警告信息。此问题的目标是清理警告。原始问题描述如下。

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

原始问题

描述错误

当我调用partition_pdf时，会出现以下情况：

from unstructured.partition.pdf import partition_pdf
path = "/app/example-docs/"
fname = "list-item-example.pdf"
raw_pdf_documents = partition_pdf(
... filename=path + fname,
... extract_images_in_pdf=False,
... infer_table_structure=True,
... chunking_strategy="by_title",
... max_characters=4000,
... new_after_n_chars=3800,
... combine_text_under_n_chars=2000,
... image_output_dir_path=path,
... )

在microsoft/table-transformer-structure-recognition的模型检查点的一些权重在初始化TableTransformerForObjectDetection时未被使用：['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']

如果从另一个任务或具有其他架构的模型的检查点初始化TableTransformerForObjectDetection(例如，从BertForPreTraining模型初始化BertForSequenceClassification模型),这是预期的行为。
如果从您期望完全相同的模型的检查点初始化TableTransformerForObjectDetection(例如，从BertForSequenceClassification模型初始化BertForSequenceClassification模型),则这不是预期的行为。
重现

只需运行上面的代码

预期行为

运行partition_pdf时没有错误或警告

截图

如果适用，请添加截图以帮助解释您的问题。

环境信息

在MacPro Intel Core i9上运行的Docker镜像
Docker容器以以下方式启动：
创建容器
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
这将使您进入一个bash shell,其中Docker镜像正在运行
docker exec -it unstructured bash

其他上下文

在此问题中添加任何其他有关问题的上下文。

unstructured

来源：https://github.com/Unstructured-IO/unstructured/issues/3288

8条答案

按热度按时间

f87krz0w1#

你好，magallardo - -你在使用arm64镜像吗？你遇到了错误还是只是警告？

赞(0）回复(0）举报 8个月前

hfsqlsce2#

@MthwRobinson 我在一台MacPro上运行容器，使用的是AMD 64芯片(非arm)。我使用以下命令获取了docker镜像，当我在我的机器上列出时，它显示如下：

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
downloads.unstructured.io/unstructured-io/unstructured latest 24326ebafc76 10 hours ago 11.6GB

每次我调用partition_pdf函数时，都会收到报告消息。我不确定这是预期的还是意料之外的。
谢谢，
Marcelo

赞(0）回复(0）举报 8个月前

0kjbasz63#

@magallardo - 你是否能从 partition_pdf 获取输出？ raw_pdf_documents 应该是一个 Element 对象的列表。

赞(0）回复(0）举报 8个月前

tcbh2hod4#

感谢您的更新。
关于您的问题，操作正在产生一些输出。我只是想确保这是否实际上返回了有效的响应，或者操作没有完成，因为消息不太清楚。
谢谢
Marcelo

赞(0）回复(0）举报 8个月前

kkih6yb85#

明白了，感谢你的澄清@magallardo。是的，你的输出是有效的。我们在构建过程中从docker容器内部运行单元测试，然后检查输出结果。
话虽如此，我们应该抑制这些警告语句，以免引起人们的担忧。我会更新这个问题的范围，以反映这一点。

赞(0）回复(0）举报 8个月前

pu82cl6c6#

@MthwRobinson this is giving me different outputs to what I had before, when it did not give me that warning and the quality of outputs is slightly worse. Can you please look into this further as it might be a bigger issue than initially thought. This issue seems to have only come up recently so may be some of the underlying packages you are using might have changed?

赞(0）回复(0）举报 8个月前

7gyucuyw7#

感谢@atangsyd,我们会查看一下。顺便说一下，@leah1985

赞(0）回复(0）举报 8个月前

ego6inou8#

让我来解释一下这个问题和我的想法。
TableTransformer模型是在unstructured-inference库中实现的。因此，我通过简单地运行以下命令来复现这个bug:

from unstructured_inference.models.tables import load_agent
load_agent()

确实，我也得到了同样的警告。然后我检查了这个警告是否在旧版本上出现，它出现在0.7.30(5月1日)和0.7.23(1月18日)。
所以总的来说，应该没有问题。分区输出的变化是意料之中的，因为很多东西都在不断发展。此外，表格输出还受到OCR或其他模块的影响，如OD模型进行的表格检测。
在最后一步，我验证了'num_batches_tracked'这个变量以及它是否重要。就我所知，如果你查看BatchNorm2D(https://pytorch.org/docs/stable/_modules/torch/nn/modules/batchnorm.html#BatchNorm2d),这是控制经过该层的批次数量的变量，这个统计数据用于进一步计算运行均值等。
因此，这个变量只在训练过程中使用，现在在推理过程中使用。此外，如果我们查看hugging face的实现(https://github.com/huggingface/transformers/blob/main/src/transformers/models/table_transformer/modeling_table_transformer.py#L218),这是一个FrozenBatchNorm,它不使用这个参数。
综上所述，我认为一切都没问题，我会只是隐藏这个警告；)

赞(0）回复(0）举报 8个月前

我来回答

unstructured 清理警告表变压器警告语句语句

8条答案

相关问题

热门标签

最新问答

unstructured 清理警告表变压器警告语句 语句

8条答案

相关问题

热门标签

最新问答

unstructured 清理警告表变压器警告语句语句