scrapy 使用过滤条件将抓取的数据保存在不同的字典中

rjee0c15  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(106)

我已经刮2网址从同一个蜘蛛如下:

def start_requests(self):
  #calling Dawn Categories
  yield Request('https://www.dawn.com/business',callback=self.Dawn, meta={'category': 'business','source': 'DAWN'})
  yield Request('https://www.dawn.com/sport',callback=self.Dawn, meta={'category': 'sports','source': 'DAWN'})

这里self.Dawn从链接中抓取消息如下:

def parseDawn(self, response):
  items = WebscrapingItem()

  title = response.css("h2.story__title a.story__link::text").extract_first().strip() ,
  author = response.css("span.story__byline a.story__byline__link::text").extract_first() , 
  category = response.meta['category']

  items['title'] = title
  items['author'] = author
  items['category'] = category

  yield items

现在,在我的pipelines.py文件中,我想过滤掉那些在两个不同的字典中有category=='business'category=='sports'的新闻。我这样做是为了过滤掉的新闻可以单独保存在我的数据库中。有没有办法做到这一点???

nr9pn0ug

nr9pn0ug1#

你可以很容易地做到这一点使用你的管道-

class BotPipeline:
    def process_item(self, item, spider):
        if item['category'] == 'business':
            # insert db operation with this filtered item
            return item
        if item['category'] == 'sports':
            # insert db operation with this filtered item
            return item

相关问题