如何使用Scrapy对动态加载(加载更多按钮)的网站进行分页?

vwoqyblh  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(213)

有人能指导我如何对gelbeseiten分页吗?我在分页部分卡住了。在网络选项卡中,有ajaxsuche选项,但我不知道从这里做什么。感谢您的指导。我的代码如下

import scrapy
from scrapy_selenium import SeleniumRequest

class Data2Spider(scrapy.Spider):
    name = 'data2'

    def start_requests(self):
        yield SeleniumRequest(
                url="https://www.gelbeseiten.de/suche/hotels/n%c3%bcrnberg",
                callback=self.parse,
            )
        return super().start_requests()

    def parse(self, response):
        temp = []

        for i in response.xpath("//article/@data-realid").getall():
            temp.append(i)

        for r in temp:
            yield SeleniumRequest(
                url=f"https://www.gelbeseiten.de/gsbiz/{r}",
                callback=self.parse_data,
            )

    def parse_data(self, response):
        yield {
                'URL': response.url,
                'Title': response.xpath("//div[@class='a']/h1/text()").get(),
                'Phone': response.xpath("//a[@class='nolink-black']/span/text()").get(),
                'Fax': response.xpath("//div[@class='mod-Kontaktdaten__list-item contains-icon-fax']/span/text()").get(),
                'email': response.xpath("normalize-space(//div[@class='mod-Kontaktdaten__list-item contains-icon-email']/a/text())").get(),
                'Website': response.xpath("normalize-space(//div[@class='mod-Kontaktdaten__list-item contains-icon-homepage']/a/text())").get()
            }
bnl4lu3b

bnl4lu3b1#

我将演示如何使用Requests获取这些酒店的配置文件链接。

import requests
from bs4 import BeautifulSoup as bs
from tqdm import tqdm ## if using Jupyter notebook, import as from tqdm.notebook import tqdm

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)

url = 'https://www.gelbeseiten.de/ajaxsuche'
hotels_urls = []
payload = {
        'umkreis': '-1', 'WAS': 'hotels', 'WO': 'nürnberg', 'position': '0', 'anzahl': '50','sortierung': 'relevanz'
    }
r = s.post(url, data=payload)
html_data = bs(r.json()['html'], 'html.parser')
hotels = html_data.select('article')
for h in hotels:
    hotels_urls.append('https://www.gelbeseiten.de/gsbiz/' + h.get('data-realid'))

for x in tqdm(range(50, 200, 10)):
    payload = {
        'umkreis': '-1', 'WAS': 'hotels', 'WO': 'nürnberg', 'position': f'{x}', 'anzahl': '10','sortierung': 'relevanz'
    }
    r = s.post(url, data=payload)
    html_data = bs(r.json()['html'], 'html.parser')
    hotels = html_data.select('article')
    for h in hotels:
        hotels_urls.append('https://www.gelbeseiten.de/gsbiz/' + h.get('data-realid'))
print('got', len(set(hotels_urls)), 'hotels')
print(set(hotels_urls))

终端打印的结果:

100%
15/15 [00:05<00:00, 3.33it/s]
got 185 hotels
{'https://www.gelbeseiten.de/gsbiz/5359beb3-4426-42a1-aacc-b385b47dd714', 'https://www.gelbeseiten.de/gsbiz/f799c238-283a-4228-802b-9d60ffef407e', 'https://www.gelbeseiten.de/gsbiz/6f3d3371-3fb5-43b6-a6b5-fe65c230e54d', 'https://www.gelbeseiten.de/gsbiz/3252f849-b33d-4902-a077-9c21f0c190d4', 'https://www.gelbeseiten.de/gsbiz/d616a81c-b48d-414e-b094-1fdafb5d1dae', [...]}

例如,该逻辑可以合并到start_urls()函数中。
对于TQDM,请访问https://pypi.org/project/tqdm/

相关问题