具体地说,我想从这个网站提取视频:equidia。第一个问题是当我启动scrapy shell https://www.equidia.fr/courses/2022-07-31/R1/C1 -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"
时。我用view(response)
检查,注意到视频的容器存在。我可以点击播放,但没有流,所以没有视频可提取。我猜蜘蛛在检查期间不能比我在Web浏览器中更有能力。
更新:多亏了palpitus_on_fire,我检查了更多的XHR
类型请求,而我只关注media
类型。
我感兴趣的XHR请求是:https://equidia-vodce-p-api.hexaglobe.net/video/info/20220731_35186008_00000HR/01NdvIX6xNbihxxc3U52RDBX9novEE47lA1ZmEDw/0b94eb69d04aa18ca33ba5e8767b1a85264c35b6acefc8bfbd7eabb5e259cf2c/1659574478
它返回一个响应json文件:
{
"mp4": {
"240": "https://private4-stream-ovp.vide.io/dl/0bee7ff9bce79419fb27bac012c5d090/62e904dc/equidia_2752987/videos/240p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-240p_full_mp4.mp4?v=0",
"360": "https://private3-stream-ovp.vide.io/dl/7c53dc70ae5cc46d7e2c9d43e3dfc685/62e904dc/equidia_2752987/videos/360p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-360p_full_mp4.mp4?v=0",
"480": "https://private3-stream-ovp.vide.io/dl/4261c5bd7b48595535710e1dc34e8f93/62e904dc/equidia_2752987/videos/480p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-480p_full_mp4.mp4?v=0",
"540": "https://private3-stream-ovp.vide.io/dl/d40592e39e27931f20245edde9f30152/62e904dc/equidia_2752987/videos/540p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-540p_full_mp4.mp4?v=0",
"720": "https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0"
}
"hls": "https://private3-stream-ovp.vide.io/dl/6abf08b12203c84390a01fc4de65a4d5/62e904dc/equidia_2752987/videos/hls/63/03/S506303/V556523/playlist.m3u8?v=1",
"master": "https://private8-stream-ovp.vide.io/dl/6830b3a5039eef57f844bb08cd252584/62e904dc/equidia_2752987/videos/master/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523.mp4?v=0"
}
"720": "https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0"
包含我想要的视频和分辨率。
现在主要存在两个问题:
1.如何获得XHR
请求,并且准确地说是以equidia-vodce-p-api.hexaglobe.net
开头的请求?流行的解决方案是使用像selenium
或playwright
这样的webtesters,但我不知道它们。
1.如何将它整合到一个 scrapy 蜘蛛中?
如果可能的话,我想得到一个这种形式的代码:
class VideoPageScraper(scrapy.Spider):
def __init__(self):
self.start_urls = ['https://www.equidia.fr/courses/2022-07-31/R1/C1']
def parse(self, response):
# code to catch the XHR request desired
urlvideo = 'https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0'
yield scrapy.Request(url=urlvideo, callback=self.obtainvideo)
def obtainvideo(self, response):
with #open('/home/avy/binary.mp4', returns'wb') as wfile:
wfile.write(response.body)
更新内容
是否可能生成XHR的URL?
查看Firefox的开发工具,我试图了解如何完成所需的XHR请求。
脚本main.d685b66f8277bab3376b.js产生的元素使XHR的url(据我所知在第1761行)。
Object { endpoint: "https://equidia-vodce-p-api.hexaglobe.net/video/info", publicKey: "01NdvIX6xNbihxxc3U52RDBX9novEE47lA1ZmEDw", hash: "c79502b086116f30262dfbf7675225dc1c8bfb4c144df09da9a34b9106fccd39", expiresAt: "1659608062" }
然后,https://www.equidia.fr/polyfills.de328f7d13aac150968d.js
启动一个带有XHR的url的fecth。
您认为有没有一种方法可以基于上面所示的脚本获得生成的对象,以便发出所需的XHR请求?
附录与剧本尝试(无头浏览器)
我找到了一个库,可以获得我正在寻找的XHR
请求。playwright
是一个webtester,它与scrapy-playwright
兼容。
受这些例子的启发,我尝试了以下代码:
from playwright.sync_api import sync_playwright
url = 'https://www.equidia.fr/courses/2022-07-31/R1/C1'
with sync_playwright() as p:
def handle_response(response):
# the endpoint we are insterested in
if ("equidia-vodce-p-api" in response.url):
item = response.json()['mp4']
print(item)
browser = p.chromium.launch()
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle", timeout=900000)
page.context.close()
browser.close()
Exception in callback SyncBase._sync.<locals>.callback(<Task finishe...t responses')>) at /home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py:104
handle: <Handle SyncBase._sync.<locals>.callback(<Task finishe...t responses')>) at /home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py:104>
Traceback (most recent call last):
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 105, in callback
g_self.switch()
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 124, in <lambda>
lambda params: self._on_response(
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 396, in _on_response
page.emit(Page.Events.Response, response)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 176, in emit
handled = self._call_handlers(event, args, kwargs)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 154, in _call_handlers
self._emit_run(f, args, kwargs)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/asyncio.py", line 50, in _emit_run
self.emit("error", exc)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 179, in emit
self._emit_handle_potential_error(event, args[0] if args else None)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 139, in _emit_handle_potential_error
raise error
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/asyncio.py", line 48, in _emit_run
coro: Any = f(*args,**kwargs)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_impl_to_api_mapping.py", line 88, in wrapper_func
return handler(
File "<input>", line 7, in handle_response
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 601, in json
self._sync("response.json", self._impl_obj.json())
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
return task.result()
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 362, in json
return json.loads(await self.text())
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 358, in text
content = await self.body()
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 354, in body
binary = await self._channel.send("body")
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
return await self.inner_send(method, params, False)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Response body is unavailable for redirect responses
{'240': 'https://private4-stream-ovp.vide.io/dl/ec808fddd1b10659cb7d3bfeb7cba591/62e9abc8/equidia_2752987/videos/240p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-240p_full_mp4.mp4?v=0', '360': 'https://private3-stream-ovp.vide.io/dl/7c5cc10f2bb466fd85d7109c698500d3/62e9abc8/equidia_2752987/videos/360p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-360p_full_mp4.mp4?v=0', '480': 'https://private3-stream-ovp.vide.io/dl/c65277c3e385fb825e0ea4f5c1447d00/62e9abc8/equidia_2752987/videos/480p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-480p_full_mp4.mp4?v=0', '540': 'https://private3-stream-ovp.vide.io/dl/036f88b63a1157ce51bf68a7e60791cc/62e9abc8/equidia_2752987/videos/540p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-540p_full_mp4.mp4?v=0', '720': 'https://private8-stream-ovp.vide.io/dl/459a1e569da63e69031d1ecbf1f87ed0/62e9abc8/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0'}
我设法得到了我想要的那块,但有很多例外,我不知道出了什么问题。
1条答案
按热度按时间nfg76nw01#
这是一个使用
scrapy-playwright
的完全可行的解决方案,我从下面的问题61和用户配置文件@lime-n中得到了这个想法。我们使用
playwright
工具和scrapy-playwright
下载发送的xhr
请求,并将这些请求存储到一个dict中。根据视频大小,下载需要几分钟时间。
main.py
items.py
pipelines.py
settings.py
第一次