pandas 我如何使用for循环来创建自动NBA统计网页抓取代码多个赛季?

oxosxuxt  于 2023-06-20  发布在  其他
关注(0)|答案(1)|浏览(98)

早上好!现在我正在制作NBA统计数据框架来分析球员的赛季统计数据。现在,我成功地编写了代码,可以自动按季节生成 Dataframe 。然而,我正在解决的问题是,我可以将一季的 Dataframe 保存为单个xlsx文件(例如:data_2001.xlsx,data_2002.xlsx,..... data_2022.xlsx),但我想使用for循环将其转换为包含许多季节的**“one”**xlsx文件。(例如:data_2001~2022.xlsx)

years = [2001,2002]

for y in years:
    # Season
    url = 'https://www.basketball-reference.com/leagues/NBA_{}_advanced.html'.format(y)
    html = urlopen(url)
    soup = BeautifulSoup(html)

    soup.findAll('tr',limit=2)
    # use getText()to extract the text we need into a list
    headers = [th.getText() for th in soup.findAll('tr',limit=2)[0].findAll('th')]
    # exclude the first column as we will not need the ranking order from Basketball Reference for the analysis
    headers = headers[1:]
    
    rows = soup.findAll('tr')[1:]
    player_stats = [[td.getText() for td in rows[i].findAll('td')] for i in range (len(rows))]
    stats = pd.DataFrame(player_stats, columns = headers)

    
    stats['Year'] = y             #Added column ['Year'] to recognize season in the dataframe  
    stats['Season_type'] = 'RS'.  #Added column ['Season_type'] to recognize season type in the dataframe  
    stats = stats.apply(pd.to_numeric,errors='ignore') # changed object data types to float to maniupulate data
    stats = stats[stats['G']>=57]                #Only players who played more than 70% of games
    stats = stats.drop(stats.columns[[18,23]],axis=1) # drop NAN columns

  
    print(f'Finished scraping data for the {y}.')
    lag = np.random.uniform(low=5,high=10)
    print(f'...waiting {round(lag,1)} seconds')
    time.sleep(lag)
    
    y_str = str(y)
    stats.to_excel('data_'+y_str+'.xlsx', index=False). # save as xlsx file

这段代码的输出为data2001.xlsx和data2002.xlsx

data_2001 = pd.read_excel('data2001.xlsx')
data_2002 = pd.read_excel('data2002.xlsx')

For data_2001For data_2002
但我想做的是一个包含很多季节的xlsx文件。我应该给我的for循环做什么修改?谢谢你帮我!!

yzuktlbb

yzuktlbb1#

你可以简单地替换最后一行,在那里你用一个append保存每个dataframe,所以每个循环都会将每个包含季节的dataframe(在你的循环中)附加到一个列表中,然后传递给pd.concat()。在循环之前,你可以这样做:

season_df = []
for y in years:
   #no need to change anything here until
   #...
   print(f'Finished scraping data for the {y}.')
   lag = np.random.uniform(low=5,high=10)
   print(f'...waiting {round(lag,1)} seconds')
   time.sleep(lag)
   season_df.append(stats)
#Outside the for loop
all_seasons_df = pd.concat(season_df)

相关问题