python 我怎样才能抓取几个以下网页与scrapy？

我试图刮一个网站与以下链接的网站的内容。例如，如果你访问一个网站寻找工作，你会在多个页面上看到一长串不同的工作（第1页，第2页，等等）。我的方法是把所有工作的网站（第1页），进入招聘广告和刮细节。之后，我想翻到第二页，再翻一遍，直到最后一页。所以我有一个带有start_urls的初始请求，然后我必须遵循请求。一个请求针对每个职务广告，第二个请求针对下一页。
下面是一个最小的示例代码：

minimal_spider.py*

class minimalSpider(scrapy.Spider):
    name = "example"
    start_urls = ["example-url.com"]

    def parse(self, response):
        for qva in response.css(".res-1234"):
            ad_item = minimalExampleItem()            

            # Scrape the job ad if there is an url.
            url = response.css("a:attr(href)").get()
            if url is not None:
                yield scrapy.Request(url=url, callback = 
                                     self.parse_ad, 
                                meta= {"item":ad_item})
            
            # If there is a next page, then crawl it also.
            next_page = response.css("next-page").get()
            if next_page is not None:
                yield scrapy.follow(url = next_page, 
                                callback = self.parse)

            yield ad_item


    def parse_ad(self, response, loader):
        # Here, extract information from website
        yield item

我的问题是，在我的parse方法中，代码在第一个yield之后返回值。因此，next_page将不会被crwaled，因为带有next_page url的请求将不会执行。总而言之，我只能抓取第一页（在本例中是起始网址），而不能抓取后面的页面。
我该怎么解决呢？

回答您的问题：
1.我如何获取/收集所有项目？

给定代码中的结构，项目的集合将在parse_ad函数中，在该函数中，每个招聘广告页面都被传递，并且还提取了您想要收集的详细信息，例如日期，职位描述，要求，薪水等。etc这是用scrapy选择器完成的，CSS或XPATH，你觉得哪个更舒服。

scrapy的返回值是什么。请求以及如何控制返回？

这个scrapy.Request函数返回了一个Request的示例，出于实用的目的，你可以把它看作是每个函数的响应参数，但是这个Request需要经过一些过程，因为它本身就是一个HTTP请求，一切都可以在你的项目的选项中配置，但是对于这个例子来说，这是不必要的。scrapy.Request本身不会在spider中使用，除非你在调用它之前将它与yield一起使用，yield scrapy.Request将是正确的方式，这里你发送你想要发出请求的url并定义callback，这个回调将是一个处理请求的response的函数，另外，你可以像在你的代码中一样用cb_kwargs传递额外的参数。

1.我不明白如何在parse_ad方法中返回我的item/itemloader

返回itemloader的方法如果是yield loader.item_loader()但你没有处理响应，那么item_loader将为空，对于这个例子我认为没有必要使用loader.item_loader（）虽然更推荐使用它，对于初学者来说更简单的使用scrapy提供的普通项目，要在你的项目中添加一个项目你可以在www.example.com文件中创建items.py一个名为JobAdvertisementItem(scrapy.Item)的类，并把你想要提取的字段像job_description = scrapy.Field()、requirements = scrapy.Field()、salary = scrapy.Field()等。然后，在parse_ad函数中，创建这个类item = JobAdvertisementItem()的示例，并填充字段item['job_description']、item['requirements']等。最后你返回它，它将准备收益项目

1.使用回调函数self.parse的第二个请求如何返回值

在第二个函数中，应该做的是传递不同作业列表的url，但在第2页，第3页，第4页等。作为一个callback，同样的函数被传递给它来提取所有的招聘广告，并将它们的url传递给parse_ad()函数。

无论如何，这里是应该为您工作的代码，您应该只更改选择器以提取招聘广告的URL，并更改选择器以提取第2页，第3页等的URL。并更改要收集的数据。

start_urls = ["example-url.com"]

def parse(self, response):
    # In this part, you will be collecting the elements of 
    # job advertisements
    for qva in response.css(".res-1234"):
        # get the url of the job ad
        job_url = qva.css("a:attr(href)").get()
        if job_url is not None:
            job_url = response.urljoin(job_url)
            # make the request and pass it to the parse_ad 
            # function to collect the data you need
            yield scrapy.Request(job_url, callback = self.parse_ad)
        
    # you get the url for the next page, page 1, page 2, ...
    next_page = response.css("next-page").get()
    next_page = response.urljoin(next_page)
    # if next_page is None you are on the last page and 
    # you no longer need to see the next one
    if next_page is not None:
        # Here you will get all the job advertisements of the next page
        # (page 2) and the link for the next page (page 3) and so on
        yield scapry.Request(next_page, callback = self.parse)

def parse_ad(self, response, loader):
    # instantiate your object and fill it
    item = JobAdvertisementItem()

    # in this part you configure the selectors depending on the page
    item["job_description"] = response.css("div.job_description...")
    item['requirements'] = response.css("div.requirements...")
    item['salary'] = response.css("div.salary...")
    yield item

python 我怎样才能抓取几个以下网页与scrapy？

1条答案

相关问题

热门标签

最新问答