scrapy yield scrappy.请求在每次迭代时都不调用parse函数

eagi6jfj  于 2023-01-17  发布在  其他
关注(0)|答案(1)|浏览(136)

在我的代码中,我必须在scraby类中使用函数。start_request从excel工作簿中获取数据,并将值赋给plate_num_xlsx变量。

  1. def start_requests(self):
  2. df=pd.read_excel('data.xlsx')
  3. columnA_values=df['PLATE']
  4. for row in columnA_values:
  5. global plate_num_xlsx
  6. plate_num_xlsx=row
  7. print("+",plate_num_xlsx)
  8. base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=&currentmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
  9. url=base_url
  10. yield scrapy.Request(url,callback=self.parse)

但是,在每次迭代时,它都应该调用parseScrapy类的()方法,在该函数内部,每个新迭代的plate_num_xlsx值需要比较解析值,正如我在print语句之后所理解的,它首先获取所有值,分配它们,然后仅使用最后分配的值调用parse但是为了让我的爬行器正常工作,我需要在每次赋值时调用并使用def parse()中的值。代码如下;

  1. import scrapy
  2. from scrapy.crawler import CrawlerProcess
  3. import pandas as pd
  4. itemList=[]
  5. class plateScraper(scrapy.Spider):
  6. name = 'scrapePlate'
  7. allowed_domains = ['dvlaregistrations.dvla.gov.uk']
  8. def start_requests(self):
  9. df=pd.read_excel('data.xlsx')
  10. columnA_values=df['PLATE']
  11. for row in columnA_values:
  12. global plate_num_xlsx
  13. plate_num_xlsx=row
  14. print("+",plate_num_xlsx)
  15. base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=&currentmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
  16. url=base_url
  17. yield scrapy.Request(url,callback=self.parse)
  18. def parse(self, response):
  19. for row in response.css('div.resultsstrip'):
  20. plate = row.css('a::text').get()
  21. price = row.css('p::text').get()
  22. a = plate.replace(" ", "").strip()
  23. print(plate_num_xlsx,a,a == plate_num_xlsx)
  24. if plate_num_xlsx==plate.replace(" ","").strip():
  25. item= {"plate": plate.strip(), "price": price.strip()}
  26. itemList.append(item)
  27. yield item
  28. else:
  29. item = {"plate": plate_num_xlsx, "price": "-"}
  30. itemList.append(item)
  31. yield item
  32. with pd.ExcelWriter('output_res.xlsx', mode='r+',if_sheet_exists='overlay') as writer:
  33. df_output = pd.DataFrame(itemList)
  34. df_output.to_excel(writer, sheet_name='result', index=False, header=True)
  35. process = CrawlerProcess()
  36. process.crawl(plateScraper)
  37. process.start()
jw5wzhpr

jw5wzhpr1#

以这种方式使用scrapy的全局变量是不起作用的,因为它是异步运行时行为,你可以选择将plate_num_xlsx变量作为回调关键字参数传递给请求对象本身。
例如:

  1. plate_num_xlsx=row
  2. base_url =f"https://dvlaregistrations.dvla.gov.uk/search/results.html?search={plate_num_xlsx}&action=index&pricefrom=0&priceto=&prefixmatches=&currentmatches=&limitprefix=&limitcurrent=&limitauction=&searched=true&openoption=&language=en&prefix2=Search&super=&super_pricefrom=&super_priceto="
  3. url=base_url
  4. yield scrapy.Request(url,callback=self.parse, cb_kwargs={'plate_num_xlsx': plate_num_xlsx})
  5. def parse(self, response, plate_num_xlsx=None):
  6. for row in response.css('div.resultsstrip'):
  7. plate = row.css('a::text').get()
  8. price = row.css('p::text').get()
  9. ...

现在变量将作为参数包含到parse函数中。

相关问题