在Python上使用Scrapy运行以下代码时,未获得任何数据擦除

nuypyhwy  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(156)

这是我用来从到到网上抓取电子邮件地址和餐馆名称的蜘蛛

  1. import scrapy
  2. class RestaurantSpider(scrapy.Spider):
  3. name = 'tripadvisorbot'
  4. start_urls = [
  5. 'https://www.tripadvisor.com/Restaurants-g188633-The_Hague_South_Holland_Province.html#EATERY_OVERVIEW_BOX'
  6. ]
  7. def parse(self, response):
  8. for listing in response.xpath('//div[contains(@class,"__cellContainer--")]'):
  9. link = listing.xpath('.//a[contains(@class,"__restaurantName--")]/@href').get()
  10. text = listing.xpath('.//a[contains(@class,"__restaurantName--")]/text()').get()
  11. complete_url = response.urljoin(link)
  12. yield scrapy.Request(
  13. url=complete_url,
  14. callback=self.parse_listing,
  15. meta={'link': complete_url,'text': text}
  16. )
  17. next_url = response.xpath('//*[contains(@class,"pagination")]/*[contains(@class,"next")]/@href').get()
  18. if next_url:
  19. yield scrapy.Request(response.urljoin(next_url), callback=self.parse)
  20. def parse_listing(self, response):
  21. link = response.meta['link']
  22. text = response.meta['text']
  23. email = response.xpath('//a[contains(@href, "mailto:")]/@href').get()
  24. yield {'Link': link,'Text': text,'Email': email}

我在Anaconda提示符下运行下面的命令行,运行上面的Spider并将其保存为json文件

  1. scrapy crawl tripadvisorbot -O tripadvisor.json

没有数据被擦除,创建了一个json文件,但它是空的。
我不知道是什么问题,我是一个新的网页抓取和Python编码一般。所有的帮助将不胜感激
谢谢

xcitsw88

xcitsw881#

在我的计算机上,HTML中没有_cellContainer--__restaurantName--类。
Page使用随机字符作为类名。
但是每一项都直接在<div data-test-target="restaurants-list">的div中,我用它来获取所有项。
后来我得到了第一个<a>(它有图像,而不是name),我跳过了textcomplete_url,而是直接运行reponse.follow(link)
当我看到包含详细信息的页面时,我会得到reponse.url,得到complete_url,得到h1,得到text
您可以将所有代码放在一个文件中,然后运行python script.py,而无需创建项目。

  1. import scrapy
  2. class RestaurantSpider(scrapy.Spider):
  3. name = 'tripadvisorbot'
  4. start_urls = [
  5. 'https://www.tripadvisor.com/Restaurants-g188633-The_Hague_South_Holland_Province.html#EATERY_OVERVIEW_BOX'
  6. ]
  7. def parse(self, response):
  8. for listing in response.xpath('//div[@data-test-target="restaurants-list"]/div'):
  9. url = listing.xpath('.//a/@href').get()
  10. print('link:', url)
  11. if url:
  12. yield response.follow(url, callback=self.parse_listing)
  13. next_url = response.xpath('//*[contains(@class,"pagination")]/*[contains(@class,"next")]/@href').get()
  14. if next_url:
  15. yield response.follow(next_url)
  16. def parse_listing(self, response):
  17. print('url:', response.url)
  18. link = response.url
  19. text = response.xpath('//h1[@data-test-target]/text()').get()
  20. email = response.xpath('//a[contains(@href, "mailto:")]/@href').get()
  21. yield {'Link': link, 'Text': text, 'Email': email}
  22. # --- run without project and save data in `output.json` ---
  23. from scrapy.crawler import CrawlerProcess
  24. c = CrawlerProcess({
  25. 'USER_AGENT': 'Mozilla/5.0',
  26. 'FEEDS': {'output.json': {'format': 'json'}}, # new in 2.1
  27. })
  28. c.crawl(RestaurantSpider)
  29. c.start()

部分结果:

  1. {"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d4766834-Reviews-Bab_mansour-The_Hague_South_Holland_Province.html", "Text": "Bab mansour", "Email": null},
  2. {"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d3935897-Reviews-Milos-The_Hague_South_Holland_Province.html", "Text": "Milos", "Email": null},
  3. {"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d10902380-Reviews-Nefeli_deli-The_Hague_South_Holland_Province.html", "Text": "Nefeli deli", "Email": "mailto:info@foodloversnl.com?subject=?"},
  4. {"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d8500914-Reviews-Waterkant-The_Hague_South_Holland_Province.html", "Text": "Waterkant", "Email": "mailto:alles@dewaterkant.nl?subject=?"},
  5. {"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d4481254-Reviews-Salero_Minang-The_Hague_South_Holland_Province.html", "Text": "Salero Minang", "Email": null},
  6. {"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d6451334-Reviews-Du_Passage-The_Hague_South_Holland_Province.html", "Text": "Du Passage", "Email": "mailto:info@dupassage.nl?subject=?"},
  7. {"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d4451714-Reviews-Lee_s_Garden-The_Hague_South_Holland_Province.html", "Text": "Lee's Garden", "Email": null},
  8. {"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d2181693-Reviews-Warunee-The_Hague_South_Holland_Province.html", "Text": "Warunee", "Email": "mailto:info@warunee.nl?subject=?"},
  9. {"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d8064876-Reviews-Sallo_s-The_Hague_South_Holland_Province.html", "Text": "Sallo's", "Email": "mailto:info@sallos.nl?subject=?"},
  10. {"Link": "https://www.tripadvisor.com/Restaurant_Review-g188633-d16841532-Reviews-Saravanaa_Bhavan_Den_Haag-The_Hague_South_Holland_Province.html", "Text": "Saravanaa Bhavan Den Haag", "Email": "mailto:hsbamsterdam@saravanabhavan.com?subject=?"},
展开查看全部

相关问题