unstructured bug(auto): file_and_type_from_url()无法识别有效的text/html; charset=utf8 Content_Type标头

hk8txs48  于 2个月前  发布在  其他
关注(0)|答案(6)|浏览(41)

描述bug

我遇到了一个网页,它被检测为CSV文件。它应该被检测为html。不幸的是,该页面返回的内容类型如下:
Content-Type: text/html; charset=utf-8

重现问题

file, filetype = file_and_type_from_url(
    url: "https://sites.google.com/umich.edu/mm-post-award-manual/project-management/cost-sharing-commitments-internal-agreements?authuser=0",
    headers: {'User-Agent': 'Mozilla/5.0'}
)

预期行为

我希望它能检测到该页面为html,但实际上它被检测为CSV。

附加上下文

当读取该页面时,只读取了其中的一部分内容,因为Google Sites以某种方式格式化HTML,导致行被分割成如下所示:

[
'<!DOCTYPE html><html lang="en-US" itemscope itemtype="http://schema.org/WebPage"><head><meta charset="utf-8"><script nonce="7aEdh0ByXXm4OdD9gpzmIA">var DOCS_timing={}; DOCS_timing[\'sl\']=new Date().getTime();</script><script nonce="7aEdh0ByXXm4OdD9gpzmIA">function _DumpException(e) {throw e;}</script><script nonce="7aEdh0ByXXm4OdD9gpzmIA">_docs_flag_initialData={"atari-emtpr":false,"atari-ebidm":true,"atari-ebids":true,"atari-edtm":true,"atari-eibrm":false,"atari-ectm":false,"atari-ects":false,"docs-text-elei":false,"docs-text-usc":true,"atari-bae":false,"docs-text-eessmkc":true,"docs-text-emtps":false,"docs-text-etsrdpn":false,"docs-text-etsrds":false,"docs-text-erdfs":false,"docs-text-encps":false,"docs-text-endes":false,"docs-text-escpv":true,"docs-text-ecfs":false,"docs-text-ecis":false,"docs-text-eessips":true,"docs-text-eectfs":false,"docs-text-edctzs":true,"docs-text-eetxpc":false,"docs-text-eetxp":false,"docs-text-lns":true,"docs-text-ertkmcp":true,"docs-text-ettctvs":false,"docs-text-ettts":false,"docs-text-issermps":false,"docs-text-emscts":false,"docs-text-ecgvd":false,"docs-text-esbbs":false,"docs-text-etccdts":false,"docs-text-etcchrs":false,"docs-text-etctrs":false,"docs-text-etctids":false,"docs-text-eltbbs":false,"docs-etshc":false,"docs-text-tbcb":2.0E7,"docs-efsmsdl":false,"docs-text-etb":false,"docs-text-esbefr":false,"docs-text-etof":false,"docs-text-ipi":false,"docs-text-ehlb":false,"docs-text-epa":true,"docs-text-ecls":true,"docs-text-dwit":false,"docs-text-elawp":false,"docs-eec":false,"docs-ecot":"","docs-text-enbcr":false,"docs-text-svofc":false,"docs-sup":"","umss":false,"docs-eldi":false,"docs-dli":false,"docs-liap":"/logImpressions","ilcm":{"eui":"AHKXmL0GP6UnOh4ObcyGLZyOq1-lslCu_VFbUQCm1RjpF5JAQeQnIevQskC6-rmVr_Xx1pjbMRTK","je":1,"sstu":1710189581077655,"si":"CJqZ6tOI7YQDFf4PbwYdF_IJYA","gsc":null,"ei":[5703839,5704621,5706832,5706836,5707711,5735806,5737800,5738529,5740814,5743124,5746992,5747261,5748029,5752694,5753329,5754229,5755096,5758823,5760348,5762259,5764268,5765551,5766777,5770435,5773678,5774347,5774852,5776517,5777194,5783801,5784947,5784967,5791299,5791782,5792684,5796151,5796473,5797291,14101306,14101502,14101510,14101534,49372443,49375322,49451559,49453045,49472071,49512373,49622831,49623181,49644023,49769345,49822929,49823172,49824163,49833470,49842863,49924714,50082748,50166959,50221728,50266230,50273536,50335897,50360148,50390165,50492350,50515335,50520321,50529111,50533184,50580252,50606355,70979410,71008281,71035308,71038263,71079946,71085249,71123572,71152133,71178680,71185178,71197834,71230233,71238954,71260350,71289154,71301338,71330601,71346960,71407393,71471882,71478208,71483995,71528605,71530091,71531305,71533377,71573878,71600925,71624114,71625588,71632274,71659821,71671626,71689868,71733783,71881299,71924359,71960548,94339809,94353376,94373966,94492857],"crc":0,"cvi":[]},"docs-ccdil":false,"docs-eil":true,"info_params":{},"buildLabel":"editors.sites-viewer-frontend_20240227.02_p0","docs-show_debug_info":false,"atari-jefp":"/_/view/jserror","docs-jern":"view","atari-rhpp":"/_/view","docs-ecuach":false,"docs-cclt":2033,"docs-ecci":true,"docs-esi":false,"docs-efypr":true,"docs-eyprp":false,"docs-eytpgcv":0}; _docs_flag_cek= null ; if (window[\'DOCS_timing\']) {DOCS_timing[\'ifdld\']=new Date().getTime();}</script><meta name="viewport" content="width=device-width, initial-scale=1"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="referrer" content="origin"><link rel="icon" href="https://lh3.googleusercontent.com/D7Lls9cfTmXrQ3tPDeQx-niO5hKS3yXYMB2K8ttobrQ9pg0as-PMZc9KGFojk9fZoiboMQUBBzIvU_fpK5hwznF5jlSRvZxxdWqiJKIHo7NR1SM"><meta property="og:title" content="Cost Sharing, Commitments &amp; Internal Agreements"><meta property="og:type" content="website"><meta property="og:url" content="https://sites.google.com/umich.edu/mm-post-award-manual/project-management/cost-sharing-commitments-internal-agreements"><meta property="og:description" content="', 
'Cost Sharing, Commitments &amp; Internal Agreements"><meta itemprop="name" content="Cost Sharing, Commitments &amp;'
]
1yjd4xko

1yjd4xko1#

当你看到这个问题时,你是如何调用 unstructured 的?
在分区时,你可以设置 content_type 参数来指定你知道的内容类型,当自动识别遇到困难时。

5ssjco0h

5ssjco0h2#

我正在使用UnstructuredURLLoader:

loader = UnstructuredURLLoader(urls=["https://sites.google.com/umich.edu/mm-post-award-manual/project-management/cost-sharing-commitments-internal-agreements?authuser=0"], continue_on_failure=False, headers={'User-Agent': 'Mozilla/5.0'})

内部这调用(伪代码):

from unstructured.partition.auto import partition

elements = partition(
       url=url, headers=self.headers, **self.unstructured_kwargs
)

我想这个"bug报告"可能更好地放在UnstructuredURLLoader库中(因为partition接受一个content_type参数),但更大的问题是,文档以一个<!DOCTYPE html>声明开始,这可能会影响到判断它是否是csv。

vwhgwdsa

vwhgwdsa3#

我需要指出的另一件事是,charset 是 Content-Type 头部中的有效指令,并不是非标准的:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type

vddsk6oq

vddsk6oq4#

请注意,页面标题中的逗号可能已被移除,这可能会解决该问题。有关它使用什么文本来检测CSV的更多上下文,请参阅附加上下文。

vxqlmq5t

vxqlmq5t5#

是的,我刚刚在输入这个 :) 这是一个我们应该处理的情况。
你可以暂时尝试使用 content_type 参数作为解决方法,但我会想办法修复这个问题。
我相信你可以将 {"content_type": "text/html"} 作为 unstructured_kwargs 参数添加到 UnstructuredURLLoader 调用中,以便将其传递给 unstructured

u4vypkhs

u4vypkhs6#

是的,我刚刚在输入这个 :) 这是一个我们应该处理的情况。
谢谢!
你可以暂时尝试使用 content_type 参数作为解决方法,但我会想办法修复这个问题。
我相信你可以将 {"content_type": "text/html"} 作为 unstructured_kwargs 参数添加到 UnstructuredURLLoader 调用中,以便将其传递给 unstructured
啊 - 是的,我相信那会起作用。由于这是更大、自动化项目的一部分,我无法进行一次性的调整(但值得考虑)。

相关问题