unstructured bug/hi_res无法从表单中提取已填充的文本

l2osamch  于 3个月前  发布在  其他
关注(0)|答案(1)|浏览(52)

描述bug

我正在使用一个用户可填写的PDF表单:filled.pdf,其中包含hi_res,但输出没有包括表单中填写的文本。
OCR_AGENT=tesseract的输出:

PDF Form Example

This is an example of a user fillable PDF form. Normally PDF is used as a final publishing format. However PDF has an option to be used as an entry form that can be edited and saved by the user.

The fields of this form have been selected to demonstrate as many as possible of the common entry fields.

This document and PDF form have been created with OpenOffice (version 3.4.0).

To fill out the form, make sure the PDF file is not read-only. If the file is read-only save it first to a folder or computer desktop. Close this file and open the saved file.

Please fill out the following fields. Important fields are marked yellow.

Given Name:

Family Name:  Address 1:    House nr:  Address 2:  Postcode:  City:      Country:  Gender:  Height (cm):  Driving License:  I speak and understand (tick all that apply):   Deutsch  English  Français  Esperanto  Latin              Favourite colour: 

Important: Save the completed PDF form (use menu File - Save).

它没有从"Given Name"字段中提取文本"Luke",以及其他所有填写的文本。

重现方法

from unstructured.partition.auto import partition
from unstructured.logger import logger
import logging
logger.setLevel(logging.INFO)
import os
os.environ["OCR_AGENT"] = "tesseract"
elements = partition("filled.pdf", strategy="hi_res")

print("\n\n".join([str(el) for el in elements]))

如果设置了os.environ["OCR_AGENT"] = "paddle",也不会起作用。

预期行为

在输出中包含PDF中的填写文本。

oxcyiej7

oxcyiej71#

刚刚手动使用Paddle OCR进行了测试。在Docker容器中通过GPU加速运行时,表现得无懈可击,更重要的是,它似乎能够解决Tesseract遗漏的所有问题。

相关问题