如何使用Scrapy或其他python库抓取视频媒体?

hl0ma9xz  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(189)

具体地说,我想从这个网站提取视频:equidia。第一个问题是当我启动scrapy shell https://www.equidia.fr/courses/2022-07-31/R1/C1 -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"时。我用view(response)检查,注意到视频的容器存在。我可以点击播放,但没有流,所以没有视频可提取。我猜蜘蛛在检查期间不能比我在Web浏览器中更有能力。
更新:多亏了palpitus_on_fire,我检查了更多的XHR类型请求,而我只关注media类型。
我感兴趣的XHR请求是:https://equidia-vodce-p-api.hexaglobe.net/video/info/20220731_35186008_00000HR/01NdvIX6xNbihxxc3U52RDBX9novEE47lA1ZmEDw/0b94eb69d04aa18ca33ba5e8767b1a85264c35b6acefc8bfbd7eabb5e259cf2c/1659574478
它返回一个响应json文件:

{
"mp4": {
    "240": "https://private4-stream-ovp.vide.io/dl/0bee7ff9bce79419fb27bac012c5d090/62e904dc/equidia_2752987/videos/240p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-240p_full_mp4.mp4?v=0",
    "360": "https://private3-stream-ovp.vide.io/dl/7c53dc70ae5cc46d7e2c9d43e3dfc685/62e904dc/equidia_2752987/videos/360p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-360p_full_mp4.mp4?v=0",
    "480": "https://private3-stream-ovp.vide.io/dl/4261c5bd7b48595535710e1dc34e8f93/62e904dc/equidia_2752987/videos/480p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-480p_full_mp4.mp4?v=0",
    "540": "https://private3-stream-ovp.vide.io/dl/d40592e39e27931f20245edde9f30152/62e904dc/equidia_2752987/videos/540p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-540p_full_mp4.mp4?v=0",
    "720": "https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0"
    }
"hls": "https://private3-stream-ovp.vide.io/dl/6abf08b12203c84390a01fc4de65a4d5/62e904dc/equidia_2752987/videos/hls/63/03/S506303/V556523/playlist.m3u8?v=1",
"master": "https://private8-stream-ovp.vide.io/dl/6830b3a5039eef57f844bb08cd252584/62e904dc/equidia_2752987/videos/master/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523.mp4?v=0"
}

"720": "https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0"包含我想要的视频和分辨率。
现在主要存在两个问题:
1.如何获得XHR请求,并且准确地说是以equidia-vodce-p-api.hexaglobe.net开头的请求?流行的解决方案是使用像seleniumplaywright这样的webtesters,但我不知道它们。
1.如何将它整合到一个 scrapy 蜘蛛中?
如果可能的话,我想得到一个这种形式的代码:

class VideoPageScraper(scrapy.Spider):
    def __init__(self):
        self.start_urls = ['https://www.equidia.fr/courses/2022-07-31/R1/C1']
    def parse(self, response):
        # code to catch the XHR request desired
        urlvideo = 'https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0'
       yield scrapy.Request(url=urlvideo, callback=self.obtainvideo)
    def obtainvideo(self, response):
        with #open('/home/avy/binary.mp4', returns'wb') as wfile:
            wfile.write(response.body)

更新内容
是否可能生成XHR的URL?
查看Firefox的开发工具,我试图了解如何完成所需的XHR请求。

脚本main.d685b66f8277bab3376b.js产生的元素使XHR的url(据我所知在第1761行)。

Object { endpoint: "https://equidia-vodce-p-api.hexaglobe.net/video/info", publicKey: "01NdvIX6xNbihxxc3U52RDBX9novEE47lA1ZmEDw", hash: "c79502b086116f30262dfbf7675225dc1c8bfb4c144df09da9a34b9106fccd39", expiresAt: "1659608062" }

然后,https://www.equidia.fr/polyfills.de328f7d13aac150968d.js启动一个带有XHR的url的fecth。
您认为有没有一种方法可以基于上面所示的脚本获得生成的对象,以便发出所需的XHR请求?
附录与剧本尝试(无头浏览器)
我找到了一个库,可以获得我正在寻找的XHR请求。playwright是一个webtester,它与scrapy-playwright兼容。
受这些例子的启发,我尝试了以下代码:

from playwright.sync_api import sync_playwright
url = 'https://www.equidia.fr/courses/2022-07-31/R1/C1'
with sync_playwright() as p:
    def handle_response(response):
        # the endpoint we are insterested in 
        if ("equidia-vodce-p-api" in response.url):
            item = response.json()['mp4']
            print(item)
    browser = p.chromium.launch()
    page = browser.new_page()
    page.on("response", handle_response)
    page.goto(url, wait_until="networkidle", timeout=900000)
    page.context.close()
    browser.close()

Exception in callback SyncBase._sync.<locals>.callback(<Task finishe...t responses')>) at /home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py:104
handle: <Handle SyncBase._sync.<locals>.callback(<Task finishe...t responses')>) at /home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py:104>
Traceback (most recent call last):
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 105, in callback
    g_self.switch()
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 124, in <lambda>
    lambda params: self._on_response(
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 396, in _on_response
    page.emit(Page.Events.Response, response)
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 176, in emit
    handled = self._call_handlers(event, args, kwargs)
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 154, in _call_handlers
    self._emit_run(f, args, kwargs)
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/asyncio.py", line 50, in _emit_run
    self.emit("error", exc)
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 179, in emit
    self._emit_handle_potential_error(event, args[0] if args else None)
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 139, in _emit_handle_potential_error
    raise error
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/asyncio.py", line 48, in _emit_run
    coro: Any = f(*args,**kwargs)
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_impl_to_api_mapping.py", line 88, in wrapper_func
    return handler(
  File "<input>", line 7, in handle_response
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 601, in json
    self._sync("response.json", self._impl_obj.json())
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
    return task.result()
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 362, in json
    return json.loads(await self.text())
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 358, in text
    content = await self.body()
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 354, in body
    binary = await self._channel.send("body")
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Response body is unavailable for redirect responses
{'240': 'https://private4-stream-ovp.vide.io/dl/ec808fddd1b10659cb7d3bfeb7cba591/62e9abc8/equidia_2752987/videos/240p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-240p_full_mp4.mp4?v=0', '360': 'https://private3-stream-ovp.vide.io/dl/7c5cc10f2bb466fd85d7109c698500d3/62e9abc8/equidia_2752987/videos/360p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-360p_full_mp4.mp4?v=0', '480': 'https://private3-stream-ovp.vide.io/dl/c65277c3e385fb825e0ea4f5c1447d00/62e9abc8/equidia_2752987/videos/480p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-480p_full_mp4.mp4?v=0', '540': 'https://private3-stream-ovp.vide.io/dl/036f88b63a1157ce51bf68a7e60791cc/62e9abc8/equidia_2752987/videos/540p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-540p_full_mp4.mp4?v=0', '720': 'https://private8-stream-ovp.vide.io/dl/459a1e569da63e69031d1ecbf1f87ed0/62e9abc8/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0'}

我设法得到了我想要的那块,但有很多例外,我不知道出了什么问题。

nfg76nw0

nfg76nw01#

这是一个使用scrapy-playwright的完全可行的解决方案,我从下面的问题61和用户配置文件@lime-n中得到了这个想法。
我们使用playwright工具和scrapy-playwright下载发送的xhr请求,并将这些请求存储到一个dict中。
根据视频大小,下载需要几分钟时间。
main.py

import scrapy
from playwright.async_api import Response as PlaywrightResponse
import jsonlines
import pandas as pd
import json
from video.items import videoItem

headers = {
    'authority': 'api.equidia.fr',
    'accept': 'application/json, text/plain, */*',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'content-type': 'application/json',
    'origin': 'https://www.equidia.fr',
    'referer': 'https://www.equidia.fr/',
    'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"macOS"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-site',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}

class videoDownloadSpider(scrapy.Spider):
    name = 'video'
    start_urls = ['https://www.equidia.fr/courses/2022-07-31/R1/C1']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                #callback = self.parse, 
                meta = {
                    'playwright':True,
                    'playwright_page_event_handlers':{
                    "response": "handle_response",
                }

                }               
            )

    async def handle_response(self, response: PlaywrightResponse) -> None:
        """
        We can grab the post data with response.request.post - there are three different types for different needs.
        The method below helps grab those resource types of 'xhr' and 'fetch' until I can work out how to only send these to the download request.
        """
        self.logger.info(f'test the log of data: {response.request.resource_type, response.request.url, response.request.method}')
        jl_file = "videos.jl"
        data = {}
        if response.request.resource_type == "xhr":
            if response.request.method == "GET":
                if 'videos' in response.request.url:

                    data['resource_type']=response.request.resource_type,
                    data['request_url']=response.request.url,
                    data['method']=response.request.method
                    with jsonlines.open(jl_file, mode='a') as writer:
                        writer.write(data)

    def parse(self, response):
        video = pd.read_json('videos.jl', lines=True)
        print('KEST: %s' % video['request_url'][0][0])

        yield scrapy.FormRequest(
            url = video['request_url'][0][0],
            headers=headers,
            callback = self.parse_video_json
        )
    def parse_video_json(self, response):
        another_url = response.json().get('video_url')
        yield scrapy.FormRequest(
            another_url,
            headers=headers,
            callback=self.extract_videos
        )

    def extract_videos(self, response):
        videos = response.json().get('mp4')
        for keys, vids in videos.items():
            loader = videoItem()
            loader['video'] = [vids]
            yield loader

items.py

class videoItem(scrapy.Item):
    video = scrapy.Field()

pipelines.py

class downloadFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None, item=None):
        file = request.url.split('/')[-1]
        video_file = f"{file}.mp4"
        return video_file

settings.py

from pathlib import Path
import os

BASE_DIR = Path(__file__).resolve().parent.parent

FILES_STORE = os.path.join(BASE_DIR, 'videos')

FILES_URLS_FIELD = 'video'
FILES_RESULT_FIELD = 'results'

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

ITEM_PIPELINES = {
    'video.pipelines.downloadFilesPipeline': 150,
}

第一次

相关问题