unstructured 从Confluence返回图像数据

uklbhaso  于 2个月前  发布在  其他
关注(0)|答案(4)|浏览(36)
from unstructured.ingest.connector.confluence import ConfluenceAccessConfig, SimpleConfluenceConfig
from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig
from unstructured.ingest.runner import ConfluenceRunner

if __name__ == "__main__":
    runner = ConfluenceRunner(
        processor_config=ProcessorConfig(
            verbose=True,
            output_dir="confluence-ingest-output",
            num_processes=2,
        ),
        read_config=ReadConfig(),
        partition_config=PartitionConfig(strategy="hi_res",pdf_infer_table_structure=True,
            metadata_exclude=["filename", "file_directory", "metadata.data_source.date_processed"],
        ),
        connector_config=SimpleConfluenceConfig(
            access_config=ConfluenceAccessConfig(
                api_token="api-key",
            ),
            user_email="my-email",
            url="url",
        ),
    )
   runner.run()

这返回一个具有层次结构的json列表,但是即使使用hi_res和pdf_infer_table_structure=True,我也无法访问任何图像数据。我得到的只是文本数据,这是必需的,但在我的用例中,我也希望从同一文档中获取图像

xfb7svmp

xfb7svmp1#

2024-06-24 08:14:06,670 MainProcess DEBUG    updating download directory to: /root/.cache/unstructured/ingest/confluence/d78233987c
2024-06-24 08:14:06,674 MainProcess INFO     running pipeline: DocFactory -> Reader -> Partitioner -> Copier with config: {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "confluence-ingest-output2", "num_processes": 2, "raise_on_error": false}
2024-06-24 08:14:06,789 MainProcess INFO     Running doc factory to generate ingest docs. Source connector: {"processor_config": {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "confluence-ingest-output2", "num_processes": 2, "raise_on_error": false}, "read_config": {"download_dir": "/root/.cache/unstructured/ingest/confluence/d78233987c", "re_download": false, "preserve_downloads": false, "download_only": false, "max_docs": null}, "connector_config": {"user_email": "[emial], "access_config": {"api_token": "*******"}, "url": "*******", "max_num_of_spaces": 500, "max_num_of_docs_from_each_space": 100, "spaces": []}, "_confluence": null}
2024-06-24 08:14:21,820 MainProcess INFO     processing 155 docs via 2 processes
2024-06-24 08:14:21,879 MainProcess INFO     Calling Reader with 155 docs
2024-06-24 08:14:21,880 MainProcess INFO     Running source node to download data associated with ingest docs
2024-06-24 08:14:57,880 MainProcess INFO     Calling Partitioner with 155 docs
2024-06-24 08:14:57,882 MainProcess INFO     Running partition node to extract content from json files. Config: {"pdf_infer_table_structure": true, "strategy": "hi_res", "ocr_languages": null, "encoding": null, "additional_partition_args": {}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": ["filename", "file_directory", "metadata.data_source.date_processed"], "metadata_include": [], "partition_endpoint": "https://api.unstructured.io/general/v0/general", "partition_by_api": false, "api_key": "*******", "hi_res_model_name": null}, partition kwargs: {}]
2024-06-24 08:14:57,888 MainProcess INFO     Creating /root/.cache/unstructured/ingest/pipeline/partitioned
2024-06-24 08:15:00,732 MainProcess INFO     Calling Copier with 155 docs
2024-06-24 08:15:00,734 MainProcess INFO     Running copy node to move content to desired output location
jqjz2hbq

jqjz2hbq2#

@christinestraub@scanny anyone who can help me on this?

4sup72z8

4sup72z83#

这将返回一个具有层次结构的json列表,但即使使用hi_res和pdf_infer_table_structure=True,我也无法访问任何图像数据。我得到的只是所需的文本数据,但在我的用例中,我也希望从同一文档中获取图像。
@ML-Abdula 你是说你无法在返回的json中获取任何类别为"Image"的元素吗?你能分享一下你要处理的文档吗?

fjnneemd

fjnneemd4#

@ML-Abdula Confluence是网页,对吗?因此,Confluence的"文档"将发送到partition_html()
HTML不会嵌入图像,而是包含指向图像的<img href=...> "链接"。partition_html()目前不会遍历这些链接以下载图像。我确信原因是下载任意图像文件所固有的安全隐患。
所以我认为这解释了为什么Confluence连接器输出中没有Image元素。你可以提出一个改进建议。也许有一种方法可以让你自行下载图像,或者识别可信区域等。但这应该在一个单独的问题中讨论,以便能够独立讨论。

相关问题