selenium 使用从javascript呈现的数据的Python Web抓取

我想从一个用javascript渲染的网站（https://nextgenftl.com/leagues/ftl-main-2022/game-weeks/week-30/players）中抓取数据。我想得到所有的球员，以及每个球员的徽章、价格和价格变化。如何在渲染后从网站中获取所有数据？
我试图在刮之前呈现整个页面（包括脚本）。

from requests_html import HTMLSession
from bs4 import BeautifulSoup

# Assign the URL,
# create the HTMLSession object,
# and run the "get" method to retrieve information from the URL
week = 30
url = f'https://nextgenftl.com/leagues/ftl-main-2022/game-weeks/week-{week}/players'
session = HTMLSession()
response = session.get(url)

# Check that the resolution code was 200
# (successfully retrieved info from URL)
res_code = response.status_code
print(res_code)
if res_code == 200:
    response.html.render() # This is the critical line. This render method runs the script tags to turn them into HTML

    # Get the page content
    soup = BeautifulSoup(response.content, 'lxml')
    print(soup.prettify())
    
else:
    print("Could not reach web page!")

我不能使用BS4，因为页面源代码不包含正文（正文全部由javascript呈现）。另外，我已经通过网络选项卡查看了哪些API正在发布数据，但它不起作用。我也尝试了 selenium ，但我仍然不知道如何从网站上抓取数据。

这里有一个方法可以用Selenium获得这些信息。它不是很快，但是它是可靠的，并且返回所有的播放器（725）。Selenium的设置是chromedriver/linux，你可以根据自己的设置来调整它，在定义驱动程序后，只需观察导入和代码。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)
url = 'https://nextgenftl.com/leagues/ftl-main-2022/game-weeks/week-30/players'
big_list = []
driver.get(url)

for x in range(10):
    players = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//ion-list[not(@id="menu-list")]//ion-item')))
    for p in players:
        p.location_once_scrolled_into_view
    wait.until(EC.presence_of_element_located((By.TAG_NAME, 'ion-infinite-scroll'))).location_once_scrolled_into_view
    
    t.sleep(1)
players = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//ion-list[not(@id="menu-list")]//ion-item')))
for p in players:
    try:
        p.location_once_scrolled_into_view
        badge = p.find_element(By.XPATH, './/ion-badge').text
        name = p.find_element(By.XPATH, './/ion-label').text
        current_price = p.find_element(By.XPATH, './/div[@title="Current Price"]').text
        price_change = p.find_element(By.XPATH, './/div[@title="Price Change"]').text
        average_points = p.find_element(By.XPATH, './/div[@title="3-Week Average Points"]').text
        events_played = p.find_element(By.XPATH, './/div[@title="Events Played"]').text
        
        big_list.append((badge, name, current_price, price_change, average_points, events_played))
    except Exception as e:
        print('error')
        continue
t.sleep(2)
print(len(big_list))
df = pd.DataFrame(big_list, columns = ['badge', 'name', 'current_price', 'price_change', 'average_points', 'events_played'])
print(df)
df.to_csv('fantasy_tennis.csv')

这将在终端中显示 Dataframe /表，并将其保存为csv：

725
badge   name    current_price   price_change    average_points  events_played
0   ATP Novak Djokovic  $19.864m    --  116.97  7
1   ATP Rafael Nadal    $19.295m    ↓ 1.137 53.92   9
2   WTA Iga Swiatek $17.835m    ↓ 0.074 72.70   13
3   WTA Ashleigh Barty  $16.800m    --  169.50  1
4   ATP Carlos Alcaraz  $15.587m    ↑ 0.494 74.14   14
... ... ... ... ... ... ...
720 WTA Dayana Yastremska   $1.450m ↓ 0.068 3.75    14
721 WTA Xiaodi You  $1.450m --  3.77    1
722 WTA Eleana Yu   $1.450m --  2.90    1
723 WTA Anastasia Zakharova $1.450m --  1.77    1
724 ATP Kacper Zuk  $1.450m --  4.16    1

请参见https://www.selenium.dev/documentation/上的Selenium文档

selenium 使用从javascript呈现的数据的Python Web抓取

1条答案

相关问题

热门标签

最新问答