这是我从这两个链接中抓取表的代码。它不会崩溃。“https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx““https://racing.hkjc.com/racing/information/English/Trainers/TrainerRanking.aspx“
但是,当我运行它时,两个表似乎相互重叠,并且打印在同一张工作表上而不是不同的工作表上,有什么方法可以解决这个问题吗?
import pandas as pd
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright
def scrape_ranking(url, sheet_name):
with sync_playwright() as pw:
browser = pw.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until="networkidle")
soup = BeautifulSoup(page.content(), "html.parser")
table = soup.select_one(".table_bd")
if table is None:
print("Table not found.")
else:
df = pd.read_html(str(table))[0]
df.to_excel("hkjc.xlsx", sheet_name=sheet_name, index=True)
# Scrape TrainerRanking page
url_trainer = "https://racing.hkjc.com/racing/information/English/Trainers/TrainerRanking.aspx"
scrape_ranking(url_trainer, "TrainerRanking")
# Scrape JockeyRanking page
url_jockey = "https://racing.hkjc.com/racing/information/English/Jockey/JockeyRanking.aspx"
scrape_ranking(url_jockey, "JockeyRanking")
print("done")
1条答案
按热度按时间abithluo1#
尝试在append模式下使用
ExcelWriter
:如果你使用
if_sheet_exists='replace'
,如果已经有一个sheet_name
工作表,它将覆盖;if_sheet_exists='overlay'
将在这种情况下添加行。