早上好!现在我正在制作NBA统计数据框架来分析球员的赛季统计数据。现在,我成功地编写了代码,可以自动按季节生成 Dataframe 。然而,我正在解决的问题是,我可以将一季的 Dataframe 保存为单个xlsx文件(例如:data_2001.xlsx,data_2002.xlsx,..... data_2022.xlsx),但我想使用for循环将其转换为包含许多季节的**“one”**xlsx文件。(例如:data_2001~2022.xlsx)
years = [2001,2002]
for y in years:
# Season
url = 'https://www.basketball-reference.com/leagues/NBA_{}_advanced.html'.format(y)
html = urlopen(url)
soup = BeautifulSoup(html)
soup.findAll('tr',limit=2)
# use getText()to extract the text we need into a list
headers = [th.getText() for th in soup.findAll('tr',limit=2)[0].findAll('th')]
# exclude the first column as we will not need the ranking order from Basketball Reference for the analysis
headers = headers[1:]
rows = soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')] for i in range (len(rows))]
stats = pd.DataFrame(player_stats, columns = headers)
stats['Year'] = y #Added column ['Year'] to recognize season in the dataframe
stats['Season_type'] = 'RS'. #Added column ['Season_type'] to recognize season type in the dataframe
stats = stats.apply(pd.to_numeric,errors='ignore') # changed object data types to float to maniupulate data
stats = stats[stats['G']>=57] #Only players who played more than 70% of games
stats = stats.drop(stats.columns[[18,23]],axis=1) # drop NAN columns
print(f'Finished scraping data for the {y}.')
lag = np.random.uniform(low=5,high=10)
print(f'...waiting {round(lag,1)} seconds')
time.sleep(lag)
y_str = str(y)
stats.to_excel('data_'+y_str+'.xlsx', index=False). # save as xlsx file
这段代码的输出为data2001.xlsx和data2002.xlsx
data_2001 = pd.read_excel('data2001.xlsx')
data_2002 = pd.read_excel('data2002.xlsx')
For data_2001For data_2002
但我想做的是一个包含很多季节的xlsx文件。我应该给我的for循环做什么修改?谢谢你帮我!!
1条答案
按热度按时间yzuktlbb1#
你可以简单地替换最后一行,在那里你用一个append保存每个dataframe,所以每个循环都会将每个包含季节的dataframe(在你的循环中)附加到一个列表中,然后传递给pd.concat()。在循环之前,你可以这样做: