我在Zyte上用智能代理托管了一个小蜘蛛。
我的蜘蛛是相当简单的,因为它从一个URL列表开始爬行。
解析方法使用简单的链接提取器来提取域上的链接,然后爬取这些链接。
简化的解析方法:
def parse(self, response):
internal_le = LinkExtractor(
allow_domains=tld_t, # try to stay on domain (this is a tldextract of response.url)
unique=True, # de-dup
#deny_extensions=self.deny_extensions
)
in_links = internal_le.extract_links(response)
for link in in_links:
if link.url:
yield Request(
link.url,
callback=self.parse,
)
字符串
由于deny_extensions默认为scrapy.DENY_EXTENSIONS,其中包括PDF文件,我认为它不会抓取PDF链接。但是,我有内部链接被重定向到外部托管的PDF文件。
下面是一些摘录日志的例子:
33: 2023-11-27 23:41:01 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1691073836/usd262net/renyendq5njmpmol8iko/2023-2024USD262ElementarySchoolStudentHandbookFinaldocx.pdf> (referer: https://west.usd262.net/about) More
34: 2023-11-27 23:41:02 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/files/v1676910235/usd262net/kgtnfuk7buzu8zthtixk/102422RevisedSpanish22-23ElementaryHandbookSP4.docx> (referer: https://west.usd262.net/about) More
35: 2023-11-27 23:41:05 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1676649887/usd262net/adlo2wuxxpqa7pmnxmkx/MiddleSchoolBellSchedule22_23docx.pdf> (referer: https://vcms.usd262.net/about) More
36: 2023-11-27 23:41:10 ERROR [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1691073617/usd262net/zjuysts6fymaf5gjumlc/VCMSStudentHandbook23-24Finaldocx.pdf> (referer: https://vcms.usd262.net/about) More
型
这是一个单一的跟踪:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/defer.py", line 279, in iter_errback
yield next(it)
^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
return next(self.data)
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
return next(self.data)
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/sh_scrapy/middlewares.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in <genexpr>
return (r for r in result or () if self._filter(r, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 352, in <genexpr>
return (self._set_referer(r, response) for r in result or ())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 27, in <genexpr>
return (r for r in result or () if self._filter(r, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 31, in <genexpr>
return (r for r in result or () if self._filter(r, response, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
for r in iterable:
File "/tmp/unpacked-eggs/__main__.egg/edtech/spiders/edcrawler.py", line 117, in parse
ex_links = external_le.extract_links(response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/linkextractors/lxmlhtml.py", line 239, in extract_links
base_url = get_base_url(response)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/utils/response.py", line 26, in get_base_url
text = response.text[0:4096]
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/scrapy/http/response/__init__.py", line 137, in text
raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text
型
我已经尝试了各种方法来改变我的链接提取器,但大概链接看起来很好的链接提取器。它的重定向,有PDF文件得到下载和产生的错误。
示例起始URL start url
该页面上的链接提取到'in_links' extracted internal link
重定向redirect to a pdf document on web host
我唯一能想到的解决这个问题的方法是使用一个自定义的中间件来替换重定向,并在request. url中查找r”.pdf$”。
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect的
我错过了什么吗?使用最新的scrapy 2.11.0.此外,在scrapy github github/6159上记录的问题。
1:scrapy docs.redirect middleware
1条答案
按热度按时间m1m5dgzv1#
我认为在这种情况下,最好的选择是子类化
RedirectMiddleware
,并简单地添加几行代码,检查.pdf
扩展的初始响应的Location头,并在发现时引发IgnoreRequest
异常。这一切都可以在短短几行中完成。
范例:
字符串
输出
型