我正在尝试使用scrapy和scrapy playwright来抓取google flights。有一个选择日期输入字段,我想获取输入日期范围,然后从该页面收集其他数据,然后再次更改日期并获取数据等等。现在我有一个脚本,它正在工作,但不完全是我想要的工作
下面是最近的代码:
import scrapy
from scrapy_playwright.page import PageCoroutine
from bs4 import BeautifulSoup
class PwExSpider(scrapy.Spider):
name = "pw_ex"
headers = {
"authority": "www.google.com",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-language": "en,ru;q=0.9",
"cache-control": "max-age=0",
# Requests sorts cookies= alphabetically
# 'cookie': 'ANID=AHWqTUmN_Nw2Od2kmVHB-V-BPMn7lUDKjrsMYy6hJGcTF6v7U8u5YjJPArPDJI4K; SEARCH_SAMESITE=CgQIhpUB; CONSENT=YES+shp.gws-20220509-0-RC1.en+FX+229; OGPC=19022519-1:19023244-1:; SID=LwgAuUOC2U32iRLEjSQUdzx-18XGenx489M7BtkpBNDmZ_obyU799NLH7re0HlcH0tGNpg.; __Secure-1PSID=LwgAuUOC2U32iRLEjSQUdzx-18XGenx489M7BtkpBNDmZ_obMMyHAVo5IhVZXcHbzyERTw.; __Secure-3PSID=LwgAuUOC2U32iRLEjSQUdzx-18XGenx489M7BtkpBNDmZ_obxoNZznCMM25HAO4zuDeNTw.; HSID=A24bEjBTX5lo_2EDh; SSID=AXpmgSwtU6fitqkBi; APISID=PhBKYPpLmXydAQyJ/AzHdHtibgwX2VeVmr; SAPISID=bR71_zlABgKzGVWh/Ae0bo1S1RV74H5p0z; __Secure-1PAPISID=bR71_zlABgKzGVWh/Ae0bo1S1RV74H5p0z; __Secure-3PAPISID=bR71_zlABgKzGVWh/Ae0bo1S1RV74H5p0z; OTZ=6574663_36_36__36_; 1P_JAR=2022-07-02-19; NID=511=V3Tw5Rz0i058NG-nDiH7T8ePoRgiQTzp1MzxA-fzgJxrMiyJmXPbOtsbbIGWUZSY47b9zRw5E_CupzMBaUwWxUfxduldltqHJ8KDFsbW4F_WbUTzaHCFnwoQqEbckzWXG-12Sj94-L-Q8AIFd9UTpOzgi1jglT2pmEUzAdJ2uvO70QZ577hdlROJ4RMxl-FMefvoSJOhJOBEsW2_8H5vffLkJX-PNvl8U9gq_vyUqb_FYGx7zFBfZ5v8YPmQFFia523NrlK_J9VhdyEwGw5B3eaicpWZ8BPTEBFlYyPlnKr5PBhKeHCBL1jjc5N9WOrXHIko0hSPuQLAV8hIaiAwjHdt9ISJM3Lv7-MTiFhz7DJhCH7l72wxJtjpjw2p4gpDA5ewL5EfnhXss6sd; SIDCC=AJi4QfEvHIMmVfhjcEMP5ngU_yyfA1iSDYNmmbNKnGq3w0EspvCZaZ8Hd1oobxtDOIsY1LjJDS8; __Secure-1PSIDCC=AJi4QfEB_vOMIx2aSaNP7YGkLcpMBxMMJQLwZ5MuHjcFPrWipfycBV4V4yjT9dtifeYHAXLU_1I; __Secure-3PSIDCC=AJi4QfFhA4ftN_yWMxTXryTwMwdIdfLZzsAyzZM0lPkjhUrrRYnQwHzg87pPFf12QdgLEvpEFFc',
"referer": "https://www.google.com/",
"sec-ch-ua": '" Not A;Brand";v="99", "Chromium";v="100", "Yandex";v="22"',
"sec-ch-ua-arch": '"x86"',
"sec-ch-ua-bitness": '"64"',
"sec-ch-ua-full-version": '"22.5.0.1879"',
"sec-ch-ua-full-version-list": '" Not A;Brand";v="99.0.0.0", "Chromium";v="100.0.4896.143", "Yandex";v="22.5.0.1879"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-model": '""',
"sec-ch-ua-platform": '"Linux"',
"sec-ch-ua-platform-version": '"5.4.0"',
"sec-ch-ua-wow64": "?0",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.143 Safari/537.36",
}
def start_requests(self):
yield scrapy.Request(
"https://www.google.com/travel/flights/search?tfs=CBwQAhooagwIAxIIL20vMDE3N3oSCjIwMjItMDctMDNyDAgDEggvbS8wNmM2MhooagwIAxIIL20vMDZjNjISCjIwMjItMDctMjJyDAgDEggvbS8wMTc3enABggELCP___________wFAAUgBmAEB&tfu=EgYIARABGAA&curr=EUR",
headers=self.headers,
meta=dict(
playwright=True,
playwright_include_page=True,
playwright_page_coroutines=[
PageCoroutine("wait_for_selector", "h3.zBTtmb.ZSxxwc"),
],
),
)
async def parse(self, response):
page = response.meta["playwright_page"]
for i in range(0, 5):
html = response.text
# print(html)
soup = BeautifulSoup(html, "html.parser")
search_date = soup.find_all("input")[-6]["value"]
await page.click(
"#yDmH0d > c-wiz.zQTmif.SSPGKf > div > div:nth-child(2) > c-wiz > div > c-wiz > div.PSZ8D.EA71Tc > div.Ep1EJd > div > div.rIZzse > div.bgJkKe.K0Tsu > div > div > div.dvO2xc.k0gFV > div > div > div:nth-child(1) > div > div.oSuIZ.YICvqf.kStSsc.ieVaIb > div > div.WViz0c.CKPWLe.U9gnhd.Xbfhhd > button"
)
yield {
"search_date": search_date,
}
上面的脚本只获取"Sun, Jul 3"
,而不是范围内的所有日期:
[
{
"search_date": "Sun, Jul 3"
},
{
"search_date": "Sun, Jul 3"
},
{
"search_date": "Sun, Jul 3"
},
{
"search_date": "Sun, Jul 3"
},
{
"search_date": "Sun, Jul 3"
}
]
所需输出:
[
{"search_date": "Sun, Jul 3"},
{"search_date": "Mon, Jul 4"},
{"search_date": "Tue, Jul 5"},
{"search_date": "Wed, Jul 6"},
{"search_date": "Thu, Jul 7"}
]
请这里有谁能帮我一把吗,我是一个很新的scrapy剧作家。谢谢
1条答案
按热度按时间tkclm6bt1#
这种提取范围内所有日期的逻辑不正确。
1.您向航班页面提出请求。
1.你会得到回应。
1.在parse方法中,您尝试提取搜索日期。
这一行只返回一个日期,而不是一个列表。此外,我不明白for循环背后的逻辑。因为这段代码没有发出任何进一步的请求,因为你正试图通过提取的日期(在问题中提到)来实现这一点。
这里需要注意的是,你必须一次得到整个html响应。你现在可以使用css selectors或soup来提取所有选定的日期。运行5次for循环,并不能解决任何问题,因为你要执行5次相同的信息。
使用
response.css('<<Path to the dates in the select>>').getall()
可以得到您要查找的所有日期。进一步处理这些信息。**逻辑上的即兴发挥:**你可以即兴发挥逻辑,我不知道你为什么要提取日期范围,当你可以只提取页面的出发和返回日期。并使用它们进行请求。或者只提取出发日期并增加它,并使用该日期进行另一次请求以获得进一步的信息。