scrapy 剧作家斯克拉比·无限卷轴

abithluo  于 2024-01-09  发布在  其他
关注(0)|答案(2)|浏览(200)

我有下面的代码。它打开了无头浏览器,我也看到页面滚动,但解析方法中的响应对象没有任何HTML。当我不使用自动滚动时,这个蜘蛛工作得很好。
该代码应该只提取产品名称和产品价格从这个网站。

import scrapy
import re
from scrapy_playwright.page import PageMethod
from bs4 import BeautifulSoup

def should_abort_request(req):
   if req.resource_type == "image":
     return True
   if req.method.lower() == 'post':
     return True

return False

scrolling_script = """
  const scrolls = 8
  let scrollCount = 0

  // scroll down and then wait for 5s
  const scrollInterval = setInterval(() => {
    window.scrollTo(0, document.body.scrollHeight)
    scrollCount++

    if (scrollCount === numScrolls) {
      clearInterval(scrollInterval)
    }
  }, 5000)
  """

class AuchanSpider(scrapy.Spider):
  name = 'auchan'
  custom_settings = {
    'PLAYWRIGHT_ABORT_REQUEST': should_abort_request
  }
  start_urls = ['https://zakupy.auchan.pl/shop/list/8029?shType=id']

  def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(
            url=url,
            callback=self.parse,
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("evaluate", scrolling_script),
                    #PageMethod("wait_for_timeout", 30000),
                    PageMethod("wait_for_selector", "._1E5b _2I59 _1wkJ _3YFw igxN _7Zx6 Eb4X _390_"),
                    PageMethod("wait_for_selector", "._1E5b _2I59 _1wkJ _3YFw igxN _7Zx6 Eb4X _390_:nth-child(60)")
                ],
            },
            errback=self.close_page,
            cb_kwargs=dict(main_url=url, page_number=0),
        )

async def parse(self, response, main_url, page_number):
    soup = BeautifulSoup(response.text, 'html.parser')
    product_containers = soup.find_all('div', class_='_1E5b _2I59 _1wkJ _3YFw igxN _7Zx6 Eb4X _390_')
    for product_container in product_containers:
        price = product_container.find(class_='_1-UB _1Evs').get_text()
        price = re.sub(r"[\n\t\s]*", "", price)
        yield {
            'productName': product_container.find(class_='_1DGZ').get_text(),
            'price': price
        }

async def close_page(self, failure):
    page = failure.request.meta["playwright_page"]
    await page.close()

字符串

zf2sa74q

zf2sa74q1#

我会比你更直接地处理这个问题。不需要BeautifulSoup,因为Playwright已经可以在live页面上选择元素了。我也不确定Scrapy是否是必要的,但是如果你愿意,你可以将下面的Playwright代码改编为Scrapy:

import re
from playwright.sync_api import sync_playwright  # 1.37.0
from time import sleep

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    url = "https://zakupy.auchan.pl/shop/list/8029?shType=id"
    page.goto(url)
    page.click("#onetrust-accept-btn-handler")
    page.click("._3YI0")
    text = page.locator("._3MDH").text_content().strip()
    expected = int(re.search(r"\d+$", text).group())
    records = {}

    while len(records) < expected:
        page.keyboard.press("PageDown")
        sleep(0.2)  # save a bit of CPU
        items = page.eval_on_selector_all(
            "._1DGZ",
            """els => els.map(e => ({
              href: e.href,
              text: e.textContent,
            }))""",
        )

        for x in items:
            # assume hrefs are unique
            records[x["href"]] = x

    print(records)
    browser.close()

字符串
这段代码会删除cookie和广告横幅,然后按PageDown直到没有更多的记录可供获取。我只是从DOM中提取标题和链接,但如果需要,您可以添加更多信息。
请注意,我使用的选择器更简单。选择器中的假设越多,如果任何一个不成立,它就越有可能失败。在您的例子中,尽管问题是使用空格而不是.来标识一个元素上的多个类(空格表示祖先),一开始就不要使用这么多的类来避免混淆。首先在浏览器控制台中检查你的选择器,请记住,这并不能保证它们在不同的环境下都能在Playwright中工作。浏览器可以生成示例选择器。尽管这些选择器通常过于具体,但它们至少是有效的,并且可以改进以更加可靠。
此外,我意识到使用页面底部的文本“Zaanadowano 361 produkt(y)na 361”来确定所有记录何时被抓取可能更好,但我将把它作为练习。
另一种方法是拦截请求而不是抓取文档,这会提供更多的数据(提供的页面约为2 MB):

import json
from playwright.sync_api import sync_playwright
from time import sleep

def scrape(page):
    url = "https://zakupy.auchan.pl/shop/list/8029?shType=id"
    items = []
    done = False

    def handle_response(response):
        nonlocal done
        api_url = "https://zakupy.auchan.pl/api/v2/cache/products"

        if response.url.startswith(api_url):
            data = response.json()
            items.append(data)

            if data["pageCount"] == data["currentPage"]:
                with open("out.json", "w") as f:
                    json.dump(items, f)
                    done = True

    page.on("response", handle_response)
    page.goto(url)
    page.click("#onetrust-accept-btn-handler")
    page.click("._3YI0")

    while not done:
        page.keyboard.press("PageDown")
        sleep(0.2)  # save a bit of CPU

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    scrape(browser.new_page())
    browser.close()


然后,您可以使用jq循环JSON以提取您想要的任何信息,例如名称:

jq '.[].results | .[] | .defaultVariant.name' < out.json


在Python中:

for x in items:
    for x in x["results"]:
        print(x["defaultVariant"]["name"])


有一个列表comp:

[x["defaultVariant"]["name"] for y in items for x in y["results"]]


请注意,上面的版本遗漏了记录的第一页,这可以从DOM中抓取,也可以使用从另一个API请求复制的头部进行单独的请求。
但是,一旦进入请求拦截领域,您可以劫持一个请求到他们的API,并将其连接以返回500个项目,从而更快速轻松地收集所有数据:

import json
from playwright.sync_api import sync_playwright
from time import sleep

def scrape(page):
    url = "https://zakupy.auchan.pl/shop/list/8029?shType=id"
    api_url = "https://zakupy.auchan.pl/api/v2/cache/products"
    new_url = "https://zakupy.auchan.pl/api/v2/cache/products?listId=8029&itemsPerPage=500&page=1&cacheSegmentationCode=019_DEF&hl=pl"
    done = False

    def handle(route, request):
        route.continue_(url=new_url)

    page.route("https://zakupy.auchan.pl/api/v2/cache/products*", handle)

    def handle_response(response):
        nonlocal done

        if response.url.startswith(api_url):
            with open("out1.json", "w") as f:
                json.dump(response.json(), f)
                done = True

    page.on("response", handle_response)
    page.goto(url)
    page.click("#onetrust-accept-btn-handler")
    page.click("._3YI0")

    while not done:
        page.keyboard.press("PageDown")
        sleep(0.2)  # save a bit of CPU

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    scrape(browser.new_page())
    browser.close()


这个结构可以如下处理,使用抓取名称的例子:

jq '.results | .[] | .defaultVariant.name' < out1.json


在Python中:

for x in data["results"]:
    print(x["defaultVariant"]["name"])

vwoqyblh

vwoqyblh2#

我想明白了。问题出在wait_for_selector上。div中不应该有'空格'。相反,空格应该替换为'.'。这就是wait_for_selector的样子。PageMethod("wait_for_selector", "._1E5b._2I59._1wkJ._3YFw.igxN._7Zx6.Eb4X._390_"),

相关问题