selenium 数据只能交替地从网站正确获取(不一致地获取

我试着从一个网站上获取数据，下面是我所做的代码：
这些模块

import bs4
import pandas as pd
import numpy as np
import random
import requests
from lxml import etree
import time
from tqdm.notebook import tqdm

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotInteractableException
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager

下面是获取每个目标产品的URL：

driver = webdriver.Chrome(ChromeDriverManager().install())

for page in tqdm(range(5, 10)):
    driver.get("https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page="+str(page)+"&sortBy=pop")
    
    skincare = driver.find_elements(By.XPATH, '//div[@class="col-xs-2-4 shopee-search-item-result__item"]//a[@data-sqe="link"]')

    for _skincare in tqdm(skincare):
        urls.append({"url":_skincare.get_attribute('href')})
driver.quit()

它被成功地获取了。接下来我做了以下事情：

data_final = pd.DataFrame(urls)

driver = webdriver.Chrome(ChromeDriverManager().install())
skincares = []

for product in tqdm(data_final["url"]):
    driver.get(product)
    try:
        company = driver.find_element(By.XPATH,"//div[@class='CKGyuW']//div[@class='_1Yaflp page-product__shop']//div[@class='_1YY3XU']//div[@class='zYQ1eS']//div[@class='_3LoNDM']").text
    except:
        company = 'none'
    try:
        product_name = driver.find_element(By.XPATH,"//div[@class='flex flex-auto eTjGTe']//div[@class='flex-auto flex-column  _1Kkkb-']//div[@class='_2rQP1z']//span").text
    except:
        product_name = 'none'
    try:
        rating = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB _14izon']").text
    except:
        rating = 'none'
    try:
        number_of_ratings = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB']").text
    except:
        number_of_ratings = 'none'
    try:
        sold = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3EOMd6']//div[@class='HmRxgn']").text
    except:
        sold = 'none'
    try:
        price = driver.find_element(By.XPATH,"//div[@class='_2Shl1j']").text
    except:
        price = 'none'
    try:
        description = driver.find_element(By.XPATH,"//div[@class='_1MqcWX']//p[@class='_2jrvqA']").text
    except:
        description = 'none'
        
    
    skincares.append({
        "url": product,
        "company": company,
        "product name": product_name,
        "rating": rating,
        "number of ratings": number_of_ratings,
        "sold": sold,
        "price": price,
        "description": description,

        })
    time.sleep(5)

我把time.sleep（x）放在这里以避免被阻塞，我尝试了x = 1，1.5，2，5，15。上面的代码得到的结果不一致。调用

skincares_data = pd.DataFrame(skincares)
skincares_data

我得到enter image description here
这是一堆空白或未正确提取的数据。但有一件事是，如果我重新运行代码，我会得到另一组数据，其中一些空白数据现在有数据，而一些正确提取的数据现在是空白的。再次运行它，同样的问题会发生。
我认为被网站“屏蔽”并不是这里的问题（我只是用time.sleep（）来确保）。
有何评论？
我试着从一个网站上获取数据，我成功地得到了网址，但每个产品的详细信息都没有被正确获取。有很多空白数据。交替地，它们要么是空白的，要么是正确获取的。

当您向下卷动页面时，页面会动态载入。下列程式码应该可以解决您的问题：

[..]
wait = WebDriverWait(driver, 15)
url='https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page=1&sortBy=pop'
driver.get(url)
rows= wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class, "shopee-search-item-result__item")]')))
for r in rows:
    r.location_once_scrolled_into_view
t.sleep(5)
products = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[@data-sqe="item"]')))
for p in products:
    name = p.find_element(By.XPATH, './/div[@data-sqe="name"]').text.strip()
    some_id = p.find_element(By.XPATH, './/a[@data-sqe="link"]').get_attribute('href').split('?sp_atk=')[0].split('-i.')[1]
    print(name, some_id)

所有项目将在终端打印：

ORIG M.Q. Cosmetics MACAROON LIP THERAPY LIPBALM WITH SPATULA | MQ
wholesale 10092844.9115684791
Magic Lip Therapy Balm in 10g jar (FREE Spatula) Rebranding NO STICKER! 286498185.11511633880
BIOAQUA COLLAGEN Nourish Lips Membrane Moisturizing Lip Mask moisture nourishing skin care soft 295464315.8585504678
Lip therapy Cosmetic Potion lipbalm
₱5 off
Free Gift 11055729.11663828134
VASELINE Rosy Lip Stick 4.8g 92328166.8130605004
Collagen Crystal lip mask lips plump gel personal care hydrating lip whitening a smacker wrinkle gel 386726777.2925165359
blk cosmetics fresh lip scrub coco crush 62677292.5532509493
[...]

Selenium文档可在here中找到

selenium 数据只能交替地从网站正确获取(不一致地获取

1条答案

相关问题

热门标签

最新问答