描述bug
用户在处理特定文档时遇到 TesseractError
错误。
重现步骤
通过API调用处理某个基于图像的文档。
预期行为
文档处理成功。
环境信息
运行在自托管的开源API上。
非结构化版本0.12.3。
Tesseract版本5.3.3。
额外背景信息
用户能够使用Tesseract版本4.1.1成功处理该文档。
堆栈跟踪:
File "/home/notebook-user/unstructured/partition/pdf.py", line 213, in partition_pdf
return partition_pdf_or_image(
File "/home/notebook-user/unstructured/partition/pdf.py", line 298, in partition_pdf_or_image
elements = _partition_pdf_or_image_local(
File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/unstructured/partition/pdf.py", line 494, in _partition_pdf_or_image_local
final_document_layout = process_data_with_ocr(
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 82, in process_data_with_ocr
merged_layouts = process_file_with_ocr(
File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 178, in process_file_with_ocr
raise e
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 166, in process_file_with_ocr
merged_page_layout = supplement_page_layout_with_ocr(
File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 202, in supplement_page_layout_with_ocr
ocr_layout = ocr_agent.get_layout_from_image(
File "/home/notebook-user/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 48, in get_layout_from_image
ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
return {
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 593, in <lambda>
Output.DATAFRAME: lambda: get_pandas_output(
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 568, in get_pandas_output
return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
run_tesseract(**kwargs)
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 252')
3条答案
按热度按时间50pmv0ei1#
Slack对话: https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1713364225537139
我们之前在 #1920 中遇到过这个错误,并用 #1996 关闭了这个问题。用户正在运行一个带有修复合并的
unstructured
版本,因此很可能这个错误以不同的原因出现。1rhkuytd2#
@qued, @scanny :关于上述问题有任何更新吗?
hwazgwia3#
你能说一下你看到的是什么以及什么时候吗?特别是具体的错误信息(包括估计的解决方法)。
你能提供一个示例文档,我们可以复制吗?