Python Scrapy - brotli.brotli.Error:解压缩错误:压缩流不完整

dfty9e19  于 2023-11-19  发布在  Python
关注(0)|答案(1)|浏览(99)

我正在从Yelp上抓取餐厅评论,我正在访问餐厅的API来做到这一点。我目前正在抓取4个星星评论,例如这个restaurant page有这个对应的API
这是当爬虫当前位于餐厅页面上时向API发送http请求的代码块

bizId = response.xpath("//meta[@name='yelp-biz-id']/@content").extract_first()
api_url = 'https://www.yelp.it/biz/' + bizId + '/review_feed?rr=' + str(n_star_filter)
yield response.follow(url=api_url, callback = self.parse_yelp_restaurant_api)

字符串
有时API被正确访问,我能够抓取它们。然而,大多数时候,我得到这个错误:

2023-10-27 15:57:39 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.yelp.it/biz/78t73jTxdUw5C-v44lj4Iw/review_feed?rr=4>
Traceback (most recent call last):
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 64, in process_response
    method(request=request, response=response, spider=spider)
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 63, in process_response
    decoded_body = self._decode(response.body, encoding.lower())
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 102, in _decode
    body = brotli.decompress(body)
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/brotli/brotli.py", line 90, in decompress
    d.finish()
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/brotli/brotli.py", line 464, in finish
    raise Error("Decompression error: incomplete compressed stream.")
brotli.brotli.Error: Decompression error: incomplete compressed stream.


我不明白这意味着什么,这真的很奇怪,一些API被下载,而其他API在它们显然彼此没有什么不同的时候产生这个错误。

inn6fuwd

inn6fuwd1#

这很可能违反了Yelp的政策,这类网站不喜欢人们以这种方式抓取数据。例如,this policy says
使用任何机器人、蜘蛛程序、服务搜索/检索应用程序或其他自动化设备、程序或手段访问、检索、复制、抓取或索引服务的任何部分或任何服务内容,除非Yelp明确允许(例如,如www.yelp.com/robots.txt所述);
基于代码和行为,很可能服务器检测到自动抓取并中途切断响应。这不是压缩问题。您可能希望通过https://www.yelp.com/developers查看Yelp API访问。

相关问题