如何让Scrapy解析CSS

cidc1ykv  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(161)

我使用this guide从我的本地影院网站上抓取电影标题。我使用Scrapy Spider和CSS解析来完成这项工作。在网站的HTML中,每个电影标题都是这样构造的:

<div class="col-md-12 movie-description">
    <h2>Minions: The Rise of Gru<h2>
        ...

下面是我的代码,它试图抓取这些信息

import scrapy

class CinemaSpider(scrapy.Spider):
    name = "cinema"
    allowed_domains = ["cannonvalleycinema10.com"]
    start_urls = ["https://cannonvalleycinema10.com/"]

    def parse(self, response):
        movie_names = response.css(".col-md-12.movie-description h2::text").extract()
        for movie_name in movie_names:
            yield {
                'name': movie_name
            }

电影院的网站是here。我尝试了各种不同的组合,以获得我正在寻找的标题添加到我的json文件,但不能弄清楚。
如果有帮助的话,我正在运行以下代码:

scrapy runspider .\cinema_scrape.py -o movies.json

我也在正确的目录中。

yqlxgs2m

yqlxgs2m1#

该页面是动态加载的,因此您必须一起尝试scrapy and json

import scrapy
    from scrapy import FormRequest
    from scrapy.crawler import CrawlerProcess
    import json
    from scrapy.http import Request

    class TestSpider(scrapy.Spider):
        name = 'test'
        url = 'https://cabbtheatres.intensify-solutions.com/embed/ajaxGetRepertoire'

        cookies = {
            'PHPSESSID': 'i8l12572hvd3a702d4nfj3vbg0',
        }

        headers = {
            'Accept': 'application/json, text/javascript, */*; q=0.01',
            'Accept-Language': 'en-US,en;q=0.9',
            'Connection': 'keep-alive',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            # 'Cookie': 'PHPSESSID=i8l12572hvd3a702d4nfj3vbg0',
            'Origin': 'https://cabbtheatres.intensify-solutions.com',
            'Referer': 'https://cabbtheatres.intensify-solutions.com/embed?location=3663456',
            'Sec-Fetch-Dest': 'empty',
            'Sec-Fetch-Mode': 'cors',
            'Sec-Fetch-Site': 'same-origin',
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
            'X-Requested-With': 'XMLHttpRequest',
            'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
            'sec-ch-ua-mobile': '?0',
            'sec-ch-ua-platform': '"Windows"',
        }

        data = {
            'location': '3663456',
            'date': '2022-07-30',
            'lang': 'en',
            'soon': '',
        }

        def start_requests(self):
            yield scrapy.FormRequest(
                url =self.url,
                method='POST',
                formdata=self.data,
                headers=self.headers,
                callback=self.parse_item,
            )

        def parse_item(self, response):
            detail=response.json()
            titles=detail['data']
            for name in titles:
                title=name['title']
                print(title)

输出:

Minions: The Rise of Gru
Thor Love and Thunder
DC League of Super-Pets
Elvis(2022)
Mrs. Harris Goes to Paris
Where the Crawdads Sing
Top Gun: Maverick
Nope

相关问题