unstructured bug/hi_res无法从表单中提取已填充的文本

l2osamch 于 9个月前发布在其他

关注(0)|答案(1)|浏览(113)

描述bug

我正在使用一个用户可填写的PDF表单：filled.pdf,其中包含hi_res,但输出没有包括表单中填写的文本。
OCR_AGENT=tesseract的输出：

PDF Form Example
This is an example of a user fillable PDF form. Normally PDF is used as a final publishing format. However PDF has an option to be used as an entry form that can be edited and saved by the user.
The fields of this form have been selected to demonstrate as many as possible of the common entry fields.
This document and PDF form have been created with OpenOffice (version 3.4.0).
To fill out the form, make sure the PDF file is not read-only. If the file is read-only save it first to a folder or computer desktop. Close this file and open the saved file.
Please fill out the following fields. Important fields are marked yellow.
Given Name:
Family Name:  Address 1:    House nr:  Address 2:  Postcode:  City:      Country:  Gender:  Height (cm):  Driving License:  I speak and understand (tick all that apply):   Deutsch  English  Français  Esperanto  Latin              Favourite colour: 
Important: Save the completed PDF form (use menu File - Save).

它没有从"Given Name"字段中提取文本"Luke",以及其他所有填写的文本。

重现方法

from unstructured.partition.auto import partition
from unstructured.logger import logger
import logging
logger.setLevel(logging.INFO)
import os
os.environ["OCR_AGENT"] = "tesseract"
elements = partition("filled.pdf", strategy="hi_res")
print("\n\n".join([str(el) for el in elements]))