unstructured 在某些PDF页面上进行页面分割后,OCR出现bug/IndexError,

brgchamk  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(42)

描述bug

一个奇怪的bug。
IndexError: list index out of range 当OCR识别PDF文档的一部分时,但根据分割大小,它并不总是发生。我的猜测是第一页很重要。
相关堆栈跟踪:

.venv/lib/python3.11/site-packages/unstructured/partition/ocr.py:171: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

filename = '/var/folders/5w/hcnw_g8d3cn9j_373dxm6jrm0000gn/T/tmpglza5pmp', out_layout = <unstructured_inference.inference.layout.DocumentLayout object at 0x16fddcdd0>, is_image = False
infer_table_structure = True, ocr_languages = 'eng', ocr_mode = 'entire_page', pdf_image_dpi = 200

    def process_file_with_ocr(
        filename: str,
        out_layout: "DocumentLayout",
        is_image: bool = False,
        infer_table_structure: bool = False,
        ocr_languages: str = "eng",
        ocr_mode: str = OCRMode.FULL_PAGE.value,
        pdf_image_dpi: int = 200,
    ) -> "DocumentLayout":
        """
        Process OCR data from a given file and supplement the output DocumentLayout
        from unsturcutured-inference with ocr.
    
        Parameters:
        - filename (str): The path to the input file, which can be an image or a PDF.
    
        - out_layout (DocumentLayout): The output layout from unstructured-inference.
    
        - is_image (bool, optional): Indicates if the input data is an image (True) or not (False).
            Defaults to False.
    
        - infer_table_structure (bool, optional):  If true, extract the table content.
    
        - ocr_languages (str, optional): The languages for OCR processing. Defaults to "eng" (English).
    
        - ocr_mode (str, optional): The OCR processing mode, e.g., "entire_page" or "individual_blocks".
            Defaults to "entire_page". If choose "entire_page" OCR, OCR processes the entire image
            page and will be merged with the output layout. If choose "individual_blocks" OCR,
            OCR is performed on individual elements by cropping the image.
    
        - pdf_image_dpi (int, optional): DPI (dots per inch) for processing PDF images. Defaults to 200.
    
        Returns:
            DocumentLayout: The merged layout information obtained after OCR processing.
        """
        merged_page_layouts = []
        try:
            if is_image:
                with PILImage.open(filename) as images:
                    image_format = images.format
                    for i, image in enumerate(ImageSequence.Iterator(images)):
                        image = image.convert("RGB")
                        image.format = image_format
                        merged_page_layout = supplement_page_layout_with_ocr(
                            out_layout.pages[i],
                            image,
                            infer_table_structure=infer_table_structure,
                            ocr_languages=ocr_languages,
                            ocr_mode=ocr_mode,
                        )
                        merged_page_layouts.append(merged_page_layout)
                    return DocumentLayout.from_pages(merged_page_layouts)
            else:
                with tempfile.TemporaryDirectory() as temp_dir:
                    _image_paths = pdf2image.convert_from_path(
                        filename,
                        dpi=pdf_image_dpi,
                        output_folder=temp_dir,
                        paths_only=True,
                    )
                    image_paths = cast(List[str], _image_paths)
                    for i, image_path in enumerate(image_paths):
                        with PILImage.open(image_path) as image:
                            merged_page_layout = supplement_page_layout_with_ocr(
>                               out_layout.pages[i],
                                image,
                                infer_table_structure=infer_table_structure,
                                ocr_languages=ocr_languages,
                                ocr_mode=ocr_mode,
                            )
E                           IndexError: list index out of range

.venv/lib/python3.11/site-packages/unstructured/partition/ocr.py:161: IndexError

重现问题

请提供一个代码片段来重现这个问题。
0uupv_Artisi+-+Brochure+-+FINAL06.06.23.pdf
如果你将其分割为每10页一个分割,你会发现30-40的范围会抛出这个错误,但其余的都没有问题。每5页一个分割也会出现这个问题。但是对于其他分割大小,如40,没有错误。

预期行为

它不应该随机出现错误,取决于分割大小 :)

截图

环境信息

Mac上的Python 3.11;也在Ubuntu上看到过

附加上下文

hi_res 提取;仅在处理此特定PDF文件时遇到过一次此错误,如下所示。

k5hmc34c

k5hmc34c1#

我在使用API时遇到了相同的问题,无法提取图片。

2024-06-17 14:05:24,476 uvicorn.error ERROR Exception in ASGI application
Traceback (most recent call last):
  File "/home/notebook-user/.local/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/notebook-user/.local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/fastapi/routing.py", line 193, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
  File "/home/notebook-user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
  File "/home/notebook-user/prepline_general/api/general.py", line 850, in general_partition
    list(response_generator(is_multipart=False))[0]
  File "/home/notebook-user/prepline_general/api/general.py", line 785, in response_generator
    response = pipeline_api(
  File "/home/notebook-user/prepline_general/api/general.py", line 440, in pipeline_api
    elements = partition_pdf_splits(
  File "/home/notebook-user/prepline_general/api/general.py", line 220, in partition_pdf_splits
    return partition(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/auto.py", line 426, in partition
    elements = _partition_pdf(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/documents/elements.py", line 593, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 626, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 192, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 288, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/utils.py", line 249, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 676, in _partition_pdf_or_image_local
    save_elements(
  File "/home/notebook-user/.local/lib/python3.10/site-packages/unstructured/partition/pdf_image/pdf_image_utils.py", line 195, in save_elements
    image_path = image_paths[page_number - 1]
IndexError: list index out of range
2024-06-17 14:05:24,478 unstructured_api INFO Backing off call_api(...) for 1.8s (fastapi.exceptions.HTTPException: 500: list index out of range)
2024-06-17 14:05:26,273 unstructured_api DEBUG pipeline_api input params: {"filename": "3a782d85-d311-45e4-a38a-f02f7d7ebce7.pdf", "response_type": "application/json", "coordinates": false, "encoding": "utf-8", "hi_res_model_name": null, "include_page_breaks": false, "ocr_languages": null, "pdf_infer_table_structure": true, "skip_infer_table_types": ["jpg,png"], "strategy": "auto", "xml_keep_tags": false, "languages": ["eng,deu,fas,ara,heb,fra"], "extract_image_block_types": ["Image"], "unique_element_ids": false, "chunking_strategy": null, "combine_under_n_chars": null, "max_characters": 2000, "multipage_sections": true, "new_after_n_chars": null, "overlap": 0, "overlap_all": false, "starting_page_number": 10}

我正在使用环境变量中的并行模式。

UNSTRUCTURED_PARALLEL_MODE_ENABLED=true 
UNSTRUCTURED_PARALLEL_MODE_SPLIT_SIZE=10 
UNSTRUCTURED_PARALLEL_MODE_THREADS=10 
UNSTRUCTURED_PARALLEL_MODE_URL=http://localhost:8000/general/v0/general 
UNSTRUCTURED_PARALLEL_RETRY_ATTEMPTS=3

相关问题