我试着从一个网站上获取数据,下面是我所做的代码:
这些模块
import bs4
import pandas as pd
import numpy as np
import random
import requests
from lxml import etree
import time
from tqdm.notebook import tqdm
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotInteractableException
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
下面是获取每个目标产品的URL:
driver = webdriver.Chrome(ChromeDriverManager().install())
for page in tqdm(range(5, 10)):
driver.get("https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page="+str(page)+"&sortBy=pop")
skincare = driver.find_elements(By.XPATH, '//div[@class="col-xs-2-4 shopee-search-item-result__item"]//a[@data-sqe="link"]')
for _skincare in tqdm(skincare):
urls.append({"url":_skincare.get_attribute('href')})
driver.quit()
它被成功地获取了。接下来我做了以下事情:
data_final = pd.DataFrame(urls)
driver = webdriver.Chrome(ChromeDriverManager().install())
skincares = []
for product in tqdm(data_final["url"]):
driver.get(product)
try:
company = driver.find_element(By.XPATH,"//div[@class='CKGyuW']//div[@class='_1Yaflp page-product__shop']//div[@class='_1YY3XU']//div[@class='zYQ1eS']//div[@class='_3LoNDM']").text
except:
company = 'none'
try:
product_name = driver.find_element(By.XPATH,"//div[@class='flex flex-auto eTjGTe']//div[@class='flex-auto flex-column _1Kkkb-']//div[@class='_2rQP1z']//span").text
except:
product_name = 'none'
try:
rating = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB _14izon']").text
except:
rating = 'none'
try:
number_of_ratings = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB']").text
except:
number_of_ratings = 'none'
try:
sold = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3EOMd6']//div[@class='HmRxgn']").text
except:
sold = 'none'
try:
price = driver.find_element(By.XPATH,"//div[@class='_2Shl1j']").text
except:
price = 'none'
try:
description = driver.find_element(By.XPATH,"//div[@class='_1MqcWX']//p[@class='_2jrvqA']").text
except:
description = 'none'
skincares.append({
"url": product,
"company": company,
"product name": product_name,
"rating": rating,
"number of ratings": number_of_ratings,
"sold": sold,
"price": price,
"description": description,
})
time.sleep(5)
我把time.sleep(x)放在这里以避免被阻塞,我尝试了x = 1,1.5,2,5,15。上面的代码得到的结果不一致。调用
skincares_data = pd.DataFrame(skincares)
skincares_data
我得到enter image description here
这是一堆空白或未正确提取的数据。但有一件事是,如果我重新运行代码,我会得到另一组数据,其中一些空白数据现在有数据,而一些正确提取的数据现在是空白的。再次运行它,同样的问题会发生。
我认为被网站“屏蔽”并不是这里的问题(我只是用time.sleep()来确保)。
有何评论?
我试着从一个网站上获取数据,我成功地得到了网址,但每个产品的详细信息都没有被正确获取。有很多空白数据。交替地,它们要么是空白的,要么是正确获取的。
1条答案
按热度按时间mxg2im7a1#
当您向下卷动页面时,页面会动态载入。下列程式码应该可以解决您的问题:
所有项目将在终端打印:
Selenium文档可在here中找到