当用scrapy刮库位时请求发出

pxq42qpu  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(121)

我在python中使用Scrapy抓取肯德基位置时遇到了问题。https://api.kfc.de/find-a-kfc/allrestaurant这里是我的原始代码:

class KFCSpider(scrapy.Spider):
    name = 'kfc'
    allowed_domains = ['www.kfc.de']
    start_urls = ['https://api.kfc.de/find-a-kfc/allrestaurant']

    def parse(self, response):
        data_json = json.loads(response.body)

        shop_list = data_json

        for _ , store in enumerate(shop_list):
            shop = {'shop_id': store['id']}
            shop['name']= store['name']
            shop['disposition']=store['operatingHoursStore'][-2]['disposition']
            shop['lon']=  store['location']['longitude']
            shop['lat'] =  store['location']['latitude']
            shop['address'] =store['address']
            shop['city'] = store['city']
            shop['accessed']= datetime.date.today()

            yield shop

它没有显示任何错误(只是显示0页被刮擦的信息),并且它输出一个空的.geojson文件。如果我在json.loads(response.body)之后添加print(data_json),它不会打印任何东西。
如果我尝试在命令行中使用curl,我会得到以下结果:

$ curl 'https://api.kfc.de/find-a-kfc/allrestaurant'                                                                                             
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access "http&#58;&#47;&#47;api&#46;kfc&#46;de&#47;find&#45;a&#45;kfc&#47;allrestaurant" on this server.<P>
Reference&#32;&#35;18&#46;17a02417&#46;1653923760&#46;34c85cb
</BODY>
</HTML>

以下方法可以替代:

curl --compressed 'https://api.kfc.de/find-a-kfc/allrestaurant' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0' -H 'Accept-Language: en-US,en;q=0.8,de-DE;q=0.5,de;q=0.3'

但是,这对Scrapy不起作用

class KFCSpider(scrapy.Spider):
    name = 'kfc'

    def start_requests(self):
        return [scrapy.Request('https://api.kfc.de/find-a-kfc/allrestaurant',
                               headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0',
                                        'Accept-Language': 'en-US,en;q=0.8,de-DE;q=0.5,de;q=0.3',
                                        'Host': 'api.kfc.de',
                                        'Accept': '*/*',
                                        'Accept-Encoding': 'deflate, gzip'
                                })
                ]
uhry853o

uhry853o1#

1.从浏览器中复制头文件(除了cookie),并在start_requests函数中创建一个请求。我不知道为什么它没有为你工作,但你可以测试头文件,看看哪里出了问题。
1.可选:您可以使用response.json()来代替json.loads()
1.不需要每次迭代都调用datetime.date.today()
1.可选:添加下载延迟。

import scrapy
import datetime

class KFCSpider(scrapy.Spider):
    name = 'kfc'
    allowed_domains = ['kfc.de']
    start_urls = ['https://api.kfc.de/find-a-kfc/allrestaurant']
    custom_settings = {'DOWNLOAD_DELAY': 0.4}

    def start_requests(self):
        headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.5",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "DNT": "1",
            "Host": "api.kfc.de",
            "Pragma": "no-cache",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
            "Sec-Fetch-User": "?1",
            "Sec-GPC": "1",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36"
        }

        yield scrapy.Request(url=self.start_urls[0], headers=headers)

    def parse(self, response):
        data_json = response.json()
        shop_list = data_json
        today = datetime.date.today()
        for _, store in enumerate(shop_list):
            shop = dict()
            shop['shop_id'] = store['id']
            shop['name'] = store['name']
            shop['disposition'] = store['operatingHoursStore'][-2]['disposition']
            shop['lon'] = store['location']['longitude']
            shop['lat'] = store['location']['latitude']
            shop['address'] = store['address']
            shop['city'] = store['city']
            shop['accessed'] = today

            yield shop

输出量:

{'shop_id': '303', 'name': 'KFC Wiesbaden / Mainz-Kastel', 'disposition': 'drivethru', 'lon': '8.288141', 'lat': '50.0191821', 'address': 'Boelckestraße 70', 'city': 'Wiesbaden / Mainz-Kastel', 'accessed': datetime.date(2022, 6, 7)}
{'shop_id': '304', 'name': 'KFC Wiesbaden', 'disposition': 'drivethru', 'lon': '8.2240346', 'lat': '50.0678552', 'address': 'Schiersteiner Straße 80', 'city': 'Wiesbaden', 'accessed': datetime.date(2022, 6, 7)}
{'shop_id': '305', 'name': 'KFC Augsburg', 'disposition': 'pickup', 'lon': '10.875068', 'lat': '48.3813095', 'address': 'Ulmer Str. 32-34', 'city': 'Augsburg', 'accessed': datetime.date(2022, 6, 7)}
...
...
...

相关问题