unstructured 使用策略"hi_res"的bug/边界框是错误的,

mwkjh3gx  于 6个月前  发布在  其他
关注(0)|答案(4)|浏览(76)

描述bug

当使用元素的坐标来创建边界框时,使用默认策略和'hi_res'策略得到的坐标是不同的。

重现步骤

sudo apt-get install -y poppler-utils  tesseract-ocr
pip install "unstructured[pdf]==0.12.5" PyMuPDF poppler-utils unstructured_inference==0.7.23 
#Image.open() issue with higher version of unstructured_interface 0.7.24 has compatibility issue with unstructured 0.12.5 so downgrading to 0.7.23 

# Partition the PDF into chunks
import fitz
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Element

elements_high_res = partition_pdf(
                        filename=document, 
                        chunk_size=chunk_size, 
                        extract_images_in_pdf=True,
                        extract_image_block_output_dir="/content/images",
                        strategy = "hi_res",
                        use_gpu=True
                         )

elements = partition_pdf(
                        filename=document, 
                        chunk_size=chunk_size
                         )

document = "/content/1706.03762v7.pdf"

# Using hi_res strategy
output_pdf_path = "/content/1706.03762v7_modded_high_res.pdf"
chunk_size = 0 
pdf_document = fitz.open(document)

for element in elements_high_res:
    if isinstance(element, Element):
        page_number = element.metadata.page_number
        bbox = element.metadata.coordinates.to_dict()
        top_left, bottom_right = bbox['points'][0], bbox['points'][2]
        if page_number is not None and bbox is not None:
            page = pdf_document[page_number - 1]  # PyMuPDF uses 0-based indexing for pages
            rect = fitz.Rect(top_left, bottom_right)
            page.draw_rect(rect, color=(1, 0, 0), width=2)  # Draw a red rectangle with a width of 2

# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()

# Using default strategy
output_pdf_path = "/content/1706.03762v7_modded.pdf"
chunk_size = 0 
pdf_document = fitz.open(document)

for element in elements:
    if isinstance(element, Element):
        page_number = element.metadata.page_number
        bbox = element.metadata.coordinates.to_dict()
        top_left, bottom_right = bbox['points'][0], bbox['points'][2]
        if page_number is not None and bbox is not None:
            page = pdf_document[page_number - 1]  # PyMuPDF uses 0-based indexing for pages
            rect = fitz.Rect(top_left, bottom_right)
            page.draw_rect(rect, color=(1, 0, 0), width=2)  # Draw a red rectangle with a width of 2

# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()
[1706.03762v7_modded_high_res.pdf](https://github.com/Unstructured-IO/unstructured/files/15441444/1706.03762v7_modded_high_res.pdf)
[1706.03762v7_modded.pdf](https://github.com/Unstructured-IO/unstructured/files/15441445/1706.03762v7_modded.pdf)
[1712.05889v2.pdf](https://github.com/Unstructured-IO/unstructured/files/15441446/1712.05889v2.pdf)
[1706.03762v7.pdf](https://github.com/Unstructured-IO/unstructured/files/15441447/1706.03762v7.pdf)

预期行为

边界框不应该因为策略的改变而改变

截图

截图以PDF的形式附加,但这里仍然有一个截图:
默认策略

高分辨率策略

环境信息

请运行 python scripts/collect_env.py ,并将输出粘贴在这里。这将帮助我们更好地了解在哪个环境中出现了bug。
公共工作簿链接 https://colab.research.google.com/drive/1z2dwE9t6zsgTcejx9RQzj_nTDHOdS4Vv?usp=sharing

额外的上下文

ocebsuys

ocebsuys1#

@leah1985 - 这看起来是模型输出的问题还是预处理/后处理问题?

s3fp2yjn

s3fp2yjn2#

我认为这不是一个"hi_res"策略问题,而是一个由于CoordinateSystem导致的"fast"策略问题。我会对此问题进行更深入的调查。

ajsxfq5m

ajsxfq5m3#

听起来不错,谢谢!

pw9qyyiw

pw9qyyiw4#

根据我的经验,高分辨率使用将PDF转换为图像的输出坐标,而这并非fast方法必须执行的任务。首先将PDF转换为图像的像素密度要高得多,导致加载了fritz的文档的坐标超出页面范围。请使用from unstructured_inference.inference.layout import convert_pdf_to_image加载图像以获得正确的格式和坐标系统。

相关问题