Pandas,如何提高性能,迭代太慢了

bbuxkriu  于 2023-10-14  发布在  其他
关注(0)|答案(1)|浏览(90)

我有一个数据集
| ID|开放月|月底|数量|
| --|--|--|--|
| 001 |2023-01-31 - 2023-01-31| 2023-02-28 2023-02-28 2023-02-28| 1 |
| 002 |2023-01-31 - 2023-01-31| 2023-03-31| 5 |
| 003 |2023-01-31 - 2023-01-31| 2023-04-30 - 2023-04-30| 4 |
| 004 |2023-02-28 2023-02-28 2023-02-28| 2023-02-28 2023-02-28 2023-02-28| 2 |
| 005 |2023-02-28 2023-02-28 2023-02-28| 2023-03-31| 3 |
| 006 |2023-02-28 2023-02-28 2023-02-28| 2023-04-30 - 2023-04-30| 6 |
| 007 |2023-03-31| 2023-03-31| 7 |
| 008 |2023-03-31| 2023-04-30 - 2023-04-30| 9 |

x = pd.DataFrame({
  'id': ['001', '002', '003', '004', '005', '006', '007', '008'],
  'open_month': ['2023-01-31', '2023-01-31', '2023-01-31', '2023-02-28', '2023-02-28', '2023-02-28', '2023-03-31', '2023-03-31'],
  'end_month': ['2023-02-28', '2023-03-31', '2023-04-30', '2023-02-28', '2023-03-31', '2023-04-30', '2023-03-31', '2023-04-30'],
  'quantity': [1, 5, 4, 2, 3, 6, 7, 9]
}
)

我需要得到一个表格,其中行是从0开始的月份数,列是开放的月份,值是结束项目的数量。
期望结果
| | 2023-02-28 2023-02-28 2023-02-28| 2023-03-31| 2023-03-31 |
| --|--|--|--|
| 0 | 0 | 2 | 7 |
| 1 | 1 | 5 | 16 |
| 2 | 6 | 11 | 16 |

table = pd.DataFrame()
for i in range(len(x['open_month'].unique())):
    for month in x['open_month'].unique():
        date = month + pd.offsets.MonthEnd(i)
        table.at[i, month] = x.query('open_month == @month and end_month <= @date')['quantity'].sum()

我的代码工作如我所料,但太慢的真实的数据(> 2百万ID的)

njthzxwz

njthzxwz1#

尝试构建一个基于列表解析的DataFrame:

open_months = x['open_month'].unique()
df = pd.DataFrame(np.array([[x[x['open_month'].eq(m) 
                               & x['end_month'].le(m + pd.offsets.MonthEnd(i))]['quantity'].sum() 
                             for i in range(len(open_months))]
                             for m in open_months]).T, columns=open_months)
2023-01-31  2023-02-28  2023-03-31
0           0           2           7
1           1           5          16
2           6          11          16

相关问题