python 保存/编辑废弃的URL到目录

ttp71kqs  于 2023-05-05  发布在  Python
关注(0)|答案(1)|浏览(109)

我已经成功地废弃了一个网站的链接,我想把它们保存到一个已经创建的名为“HerHoops”的本地文件夹中,以便以后解析。在过去,我已经成功地做到了这一点,但这个网站的链接需要多一点清理。
到目前为止,这是我的代码。我想保持一切后,“框_分数”的链接,使保存的文件名包括日期和球队的比赛。也保存在写模式(“w+”)。

url = f"https://herhoopstats.com/stats/wnba/schedule_date/2004/6/1/"
data = requests.get(url)
soup = BeautifulSoup(data.text)
matchup_table = soup.find_all("div", {"class": "schedule"})[0]

links = matchup_table.find_all('a')
links = [l.get("href") for l in links]
links = [l for l in links if '/box_score/' in l]

box_scores_urls = [f"https://herhoopstats.com{l}" for l in links]

for box_scores_url in box_scores_urls:
      data = requests.get(box_scores_url)
      # within loop opening up page and saving to folder in write mode
      with open("HerHoops/{}".format(box_scores_url[46:]), "w+") as f:
         # write to the files
         f.write(data.text) 
      time.sleep(3)

错误是

FileNotFoundError: [Errno 2] No such file or directory: 'HerHoops/2004/06/01/new-york-liberty-vs-charlotte-sting/'
tyky79it

tyky79it1#

从错误本身来看,很明显您正在尝试写入文件“HerHoops/2004/06/01/new-york-liberty-vs-charlotte-sting/”,但部分目录不存在我们可以在写入文件之前使用os.makedirs()函数创建必要的目录
全码

import os
import time
import requests
from bs4 import BeautifulSoup
import re
from datetime import datetime

url = f"https://herhoopstats.com/stats/wnba/schedule_date/2004/6/1/"
data = requests.get(url)
soup = BeautifulSoup(data.text)
matchup_table = soup.find_all("div", {"class": "schedule"})[0]

links = matchup_table.find_all('a')
links = [l.get("href") for l in links]
links = [l for l in links if '/box_score/' in l]

box_scores_urls = [f"https://herhoopstats.com{l}" for l in links]

for box_scores_url in box_scores_urls:
    data = requests.get(box_scores_url)
    # extract date and teams from the box_scores_url
    date_str = datetime.strptime(re.sub(r'\D', '', url), "%Y%m%d").strftime("%Y-%m-%d")
    # Get the latter part of the string using slicing
    match = re.search(r'\d+(?!.*\d)', box_scores_url.replace('/', ''))
    teams_str = box_scores_url.replace('/', '')[match.end():]
    # create the directory if it doesn't exist
    directory = f"HerHoops/"
    os.makedirs(directory, exist_ok=True)
    # within loop opening up page and saving to folder in write mode
    with open(f"{directory}{date_str}-{teams_str}.html", "w+") as f:
        # write to the file
        f.write(data.text)
    time.sleep(3)

相关问题