unstructured 错误:TesseractError: 估计分辨率为X

lxkprmvk  于 2个月前  发布在  其他
关注(0)|答案(3)|浏览(39)

描述bug

用户在处理特定文档时遇到 TesseractError 错误。

重现步骤

通过API调用处理某个基于图像的文档。

预期行为

文档处理成功。

环境信息

运行在自托管的开源API上。
非结构化版本0.12.3。
Tesseract版本5.3.3。

额外背景信息

用户能够使用Tesseract版本4.1.1成功处理该文档。
堆栈跟踪:

File "/home/notebook-user/unstructured/partition/pdf.py", line 213, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/unstructured/partition/pdf.py", line 298, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf.py", line 494, in _partition_pdf_or_image_local
    final_document_layout = process_data_with_ocr(
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 82, in process_data_with_ocr
    merged_layouts = process_file_with_ocr(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 178, in process_file_with_ocr
    raise e
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 166, in process_file_with_ocr
    merged_page_layout = supplement_page_layout_with_ocr(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 202, in supplement_page_layout_with_ocr
    ocr_layout = ocr_agent.get_layout_from_image(
  File "/home/notebook-user/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 48, in get_layout_from_image
    ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
    return {
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 593, in <lambda>
    Output.DATAFRAME: lambda: get_pandas_output(
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 568, in get_pandas_output
    return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
    run_tesseract(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 252')
50pmv0ei

50pmv0ei1#

Slack对话: https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1713364225537139
我们之前在 #1920 中遇到过这个错误,并用 #1996 关闭了这个问题。用户正在运行一个带有修复合并的 unstructured 版本,因此很可能这个错误以不同的原因出现。

1rhkuytd

1rhkuytd2#

@qued, @scanny :关于上述问题有任何更新吗?

hwazgwia

hwazgwia3#

你能说一下你看到的是什么以及什么时候吗?特别是具体的错误信息(包括估计的解决方法)。
你能提供一个示例文档,我们可以复制吗?

相关问题