scrapy 对0个页面进行粗略爬网,即使存在元素

ohfgkhjo  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(141)

我尝试刮transfermarkt.nl与scrapy的帮助下.该网站曾经给予一个404错误,所以改变了设置为

HTTPERROR_ALLOWED_CODES = [404]
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
ROBOTSTXT_OBEY = False

现在我跑的时候

import scrapy

class TransferMarketScraper(scrapy.Spider):
    name = 'transfermarket'
    starts_urls = ['https://www.transfermarkt.nl/heracles-almelo/kader/verein/1304/saison_id/2022/plus/1']

    def parse(self, response):
        for player in response.css('div.grid-view table.items tbody tr').get():
            #player number
            try:
                player_number = int(
                    player.css('div.rn_nummer::text').get().strip()
                )
            except ValueError:
                player_number = 'NA'
            except AttributeError:
                continue

            yield {'player_number': player_number}`

我得到爬网0页,即使当我使用scrapy shell检查时,响应确实返回值。这里可能有什么问题?

xxls0lw8

xxls0lw81#

您没有发送需要解析的请求,您需要添加

def start_requests(self):
        starts_urls = ['https://www.transfermarkt.nl/heracles-almelo/kader/verein/1304/saison_id/2022/plus/1']
        for url in starts_urls:
            yield scrapy.Request(url=url, callback=self.parse)
xoefb8l8

xoefb8l82#

您有一个打字错误。您将start_urls写成了starts_urls

编辑:

您可能需要更改的另一件事是删除get()

import scrapy

class TransferMarketScraper(scrapy.Spider):
    name = 'transfermarket'
    start_urls = ['https://www.transfermarkt.nl/heracles-almelo/kader/verein/1304/saison_id/2022/plus/1']

    def parse(self, response):
        for player in response.css('div.grid-view table.items tbody tr'):
            #player number
            try:
                player_number = int(
                    player.css('div.rn_nummer::text').get().strip()
                )
            except ValueError:
                player_number = 'NA'
            except AttributeError:
                continue

            yield {'player_number': player_number}

相关问题