scrapy 运行时请求URL更改不起作用

sqxo8psd  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(206)

我已经用Scrapy用Python写了一个脚本。代码运行以获取所有包含代码的页面。当scrapy启动时,它在第一次页面加载时工作正常,并根据脚本逻辑获得第2页。但在加载第2页后,我无法获得加载的新页面的xpath,因此我可以继续这样做,并获得所有的网页编号。
共享代码段。

import scrapy
from scrapy import Spider

class PostsSpider(Spider):

   name = "posts"
   start_urls = [
    'https://www.boston.com/category/news/'
   ]

def parse(self, response):
    print("first time")
    print(response)
    results = response.xpath("//*[contains(@id, 'load-more')]/@data-next-page").extract_first()
    print(results)
    if results is not None:
        for result in results:
            page_number = 'page/' + result
            new_url = self.start_urls[0] + page_number
            print(new_url)
            yield scrapy.Request(url=new_url, callback=self.parse)
    else:
        print("last page")
pepwfjgg

pepwfjgg1#

这是因为页面在加载下一个页面时不会创建新的get请求,它会对一个返回json的api进行 AJAX 调用。
我对你的代码做了一些调整,所以现在应该可以正常工作了。我假设你要从每一页中提取的不是下一个页码,所以我把html字符串 Package 到一个scrapy. selector类中,这样你就可以在它上面使用Xpath等。这个脚本会很快地抓取很多页,所以你可能需要调整你的设置来减缓它。

import scrapy
from scrapy import Spider
from scrapy.selector import Selector

class PostsSpider(Spider):

    name = "posts"
    ajaxurl = "https://www.boston.com/wp-json/boston/v1/load-more?taxonomy=category&term_id=779&search_query=&author=&orderby=&page=%s&_wpnonce=f43ab1aae4&ad_count=4&redundant_ids=25129871,25130264,25129873,25129799,25128140,25126233,25122755,25121853,25124456,25129584,25128656,25123311,25128423,25128100,25127934,25127250,25126228,25126222"
    start_urls = [
        'https://www.boston.com/category/news/'
    ]

    def parse(self, response):
        new_url = None
        try:
            json_result = response.json()

            html = json_result['data']['html']
            selector = Selector(text=html, type="html")
            # ... do some xpath stuff with selector.xpath.....
            new_url = self.ajaxurl % json_result["data"]["nextPage"]
        except:
            results = response.xpath("//*[contains(@id, 'load-more')]/@data-next-page").extract_first()
            if results is not None:
                for result in results:
                    new_url = self.ajaxurl % result
        if new_url:
            print(new_url)
            yield scrapy.Request(url=new_url, callback=self.parse)

相关问题