使用scrapy遍历API URL列表

sh7euo9m  于 2023-04-06  发布在  其他
关注(0)|答案(1)|浏览(215)

我有这段代码,我想迭代“list_of_urls”,但我不知道如何在“url”变量中调用它。有没有一种方法可以传递这个列表并迭代pageNumber?

import scrapy
import json
 
list_of_urls = []
for i in range(1,3):
    url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i)
    to_append = [url]
    for j in to_append:
        list_of_urls.append(j)

print(list_of_urls)
class TestSpider(scrapy.Spider):
    name = "test"
       
    headers = {
        'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
        }
   
    def start_requests(self):
        yield scrapy.Request(
            url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber=7&pageSize=42', 
            callback= self.parse,
            method= "GET",
            headers= self.headers

        )
    def parse(self, response):
        pass
        json_response = json.loads(response.text)
        res = json_response["result"]["items"]
        for item in res:
            yield {
                'lat': item['realEstate']['address']['geoLocation']['lat'],
                'lon': item['realEstate']['address']['geoLocation']['lon'],
                'price': item['realEstate']['price']
            }
v440hwme

v440hwme1#

是的,有很多方法可以做到这一点。
一种方法是简单地使用for循环并在start_requests方法中迭代list_of_urls变量。
示例:

...

list_of_urls = []
for i in range(1,3):
    url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i)
    list_of_urls.append(url)

print(list_of_urls)

...
...

    def start_requests(self):
        for url in list_of_urls:
            yield scrapy.Request(
                url = url, 
                callback= self.parse,
                method= "GET",
                headers= self.headers)

另一种方法是简单地将list_of_urls代码移动到start_requests方法中:

def start_requests(self):
    for i in range(1,3):
        url = 'https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i)
        yield scrapy.Request(url=url, headers=self.headers)

一些附加提示:
您可以使用custom_settings来设置USER_AGENT设置,而不是在每个请求的头中设置它。
正如你在我的第一个例子中看到的,你不必要地将url添加到列表中,然后迭代该列表以将其附加到list_of_urls,而你可以简单地将url附加到列表中。
“GET”方法是scrapy请求的默认方法,因此不需要显式设置它,回调和self.parse也是如此,它将默认选择它。
在parse方法中,您可以简单地使用response.json()而不是json_response = json.loads(response.text)
使用以上所有代码,您的代码可能如下所示。

import scrapy

class TestSpider(scrapy.Spider):
    name = "test"
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
    }

    def start_requests(self):
        for i in range(1, 3):
            yield scrapy.Request('https://api.yaencontre.com/v3/search?family=FLAT&lang=es&location=albacete-provincia&operation=RENT&pageNumber={}&pageSize=42'.format(i))

    def parse(self, response):
        for item in response.json()["result"]["items"]:
            yield {
               'lat': item['realEstate']['address']['geoLocation']['lat'],
               'lon': item['realEstate']['address']['geoLocation']['lon'],
               'price': item['realEstate']['price']
            }

相关问题