使用Pandas按类别计算日期之间差异的最佳实践

q3qa4bjr  于 2023-03-06  发布在  其他
关注(0)|答案(1)|浏览(113)

我试图对Pandas DataFrame中按日期和类别分组的值之间的差异进行排序。最后,重要的是两个日期之间增长最低和最高的类别的名称,以及相应的增长。
我认为我的代码工作,但它看起来过于复杂。我想找到最好的Pandas的方式(最快,最标准,最直接向前等)来做这件事。以下是我的代码:

import pandas as pd
import numpy as np

# Creation of random data
size = 1_000
df = pd.DataFrame()
df['Borough'] = np.random.choice(['Brooklyn', 'Manhattan', 'Bronx', 'Queens', 'Staten Island'], size)
df['Date'] = pd.to_datetime(np.random.randint(2011, 2021, size), format="%Y")
df['Nbr_permits'] = np.random.randint(0, 300, size)

# Calculation of the sorted differences in the number of permits per boroughs between 2011 and 2020
res = (df[(df['Date'].dt.year == 2020)].groupby('Borough')['Nbr_permits'].sum() - df[(df['Date'].dt.year == 2011)].groupby('Borough')['Nbr_permits'].sum()).sort_values().dropna()

#Lowest progression of nbr_permits between 2011 and 2020:
print(res.idxmin(), res[res.idxmin()])

#Highest progression of nbr_permits between 2011 and 2020:
print(res.idxmax(), res[res.idxmax()])

我能和Pandas做得更好吗?

8yoxcaq7

8yoxcaq71#

有更好的办法

# extract year from Date column
df['Year'] = df['Date'].dt.year

# Reshape into wide format and agg 
# to sum permits per Borough and Year
df1 = df.pivot_table(index='Borough', 
                     columns='Year', 
                     values='Nbr_permits', 
                     aggfunc='sum')

# Calculate the diff
diff = df1[2020] - df1[2011]
diff[[diff.idxmin(), diff.idxmax()]]

结果

Borough
Queens     -1236
Brooklyn    2097
dtype: int64

相关问题