unstructured bug/partition_html 使用不同的参数输出不同的结果

7fyelxc5 于 7个月前发布在其他

关注(0)|答案(5)|浏览(70)

错误描述

当我将 url 传递给 partition_html 时，它输出正确。然而，当我传递文本时，它提取了错误的内容。
我相信源代码中存在一些与当 text 作为参数传递而不是 url 有关的错误。我也尝试使用源代码，它运行正常。以下是整个代码。

代码片段

from unstructured.partition.html import partition_html
# pass url
url = "https://www.geeksforgeeks.org/difference-between-compiler-and-interpreter/"
elements = partition_html(url=url,html_assemble_articles=False)
elements_dict = [elem.to_dict() for elem in elements]
print(len(elements))
# 71

# pass text
import requests
url = "https://www.geeksforgeeks.org/difference-between-compiler-and-interpreter/"
response = requests.get(url)

elements = partition_html(text=response.text, html_assemble_articles=False)
elements_dict = [elem.to_dict() for elem in elements]
print(elements)
# 7

# use source code
from unstructured.documents.html import HTMLDocument
from unstructured.documents.xml import VALID_PARSERS
from unstructured.partition.common import document_to_element_list
from unstructured.partition.lang import apply_lang_metadata

document = HTMLDocument.from_string(str(response.text))
elements = list(
        apply_lang_metadata(
            document_to_element_list(
                document,
                sortable=False,
                include_page_breaks=False,
                detection_origin=None,
            ),
            languages=['auto'],
            detect_language_per_element=False,
        ),
    )
elements_dict = [elem.to_dict() for elem in elements]
print(elements)
# 71

unstructured

来源：https://github.com/Unstructured-IO/unstructured/issues/3116

5条答案

按热度按时间

vom3gejh1#

你好@KMayank29 -感谢你的报告。为了澄清，partition_html(url=url) 或 partition_html(text=response.text) 是否给出了正确的输出？

赞(0）回复(0）举报 7个月前

ctehm74n2#

@KMayank29听起来像是编码错误。当你传入text时，它的类型是什么？
如果它是bytes,并且HTML中没有包含编码声明，那么在传入之前，你需要将其解码为str。例如：

html_text = html_bytes.decode("utf-8")

你需要为你的情况找出编码，它不一定是"utf-8"。

赞(0）回复(0）举报 7个月前

iovurdzv3#

你好@KMayank29 -感谢你的报告。为了澄清，partition_html(url=url) 或 partition_html(text=response.text) 给出了正确的输出吗？
pertition_html(url=url) 给出了正确的输出。partition_html(text=response.text) 只输出2或3句话，并且只有两种类型的元素。

赞(0）回复(0）举报 7个月前

mnowg1ta4#

@KMayank29 听起来像是编码错误。当你传入 text 时，它的类型是什么？
如果它是 bytes 并且HTML中没有包含编码声明，那么在传入之前，你需要将其解码为 str。例如：

html_text = html_bytes.decode("utf-8")

你需要为你的情况找出编码，它不一定是 "utf-8"。
我将 str 类型的数据传入 partition_html(text=response.text)。

import requests
url = "https://www.geeksforgeeks.org/difference-between-compiler-and-interpreter/"
response = requests.get(url)
type(response.text)
# str

赞(0）回复(0）举报 7个月前

6rqinv9w5#

有任何更新吗？

赞(0）回复(0）举报 7个月前