unstructured bug/两栏PDF分区结果文本不正确,

ejk8hzay 于 2个月前发布在其他

关注(0)|答案(3)|浏览(38)

描述问题

在对两列PDF进行分区时，文本提取将字符放在错误的位置。

重现问题

elements = partition("two_col.pdf", strategy="fast")
text attribute of elements[2] = '1. Exchange of Information. The parties agree to exchange Confidential Information for the purpose of (the evaluating a potential business "Purpose") in accordance with this Agreement.'
text attribute of elements[3] = 'relationship'

实际上的PDF文本为：'1.Exchange of Information. The parties agree to exchange Confidential Information for the purpose of evaluating a potential business relationship (the "Purpose") in accordance with this Agreement.'

预期行为

提取的文本与实际文本匹配。

截图

无法提供截图。

环境信息

请运行以下命令并将输出粘贴到这里。
操作系统版本：macOS-14.5-arm64-arm-64bit
Python版本：3.9.6
unstructured版本：0.14.9
unstructured-inference版本：0.7.36
pytesseract版本：0.3.10
Torch版本：2.3.1
未安装Detectron2
未安装PaddleOCR
Libmagic版本：file-5.41
来自/usr/share/file/magic的magic文件
LibreOffice版本：==> libreoffice: 24.2.4

unstructured

来源：https://github.com/Unstructured-IO/unstructured/issues/3325