使用Python将格式笨拙的Excel数据转换为表格格式

2guxujil  于 2023-05-19  发布在  Python
关注(0)|答案(1)|浏览(126)

我有一个Excel电子表格,其中包含每月每一天的记录。不幸的是,数据集的格式很笨拙,很难进行分析。我想将数据重组成表格格式,每个标题下的日期、地点和相应数量列。我已经附上了当前格式的图像以及所需的输出。

当前格式

0    1       2    3       4    5       6    7       8    9
0   01/01/2023  NaN     NaN  NaN     NaN  NaN     NaN  NaN     NaN  NaN
1       Venue1  QTY  Venue2  QTY  Venue3  QTY  Venue4  QTY  Venue5  QTY
2            A    0       A    0       A    1       A    0       A    0
3            B   17       B    3       B   11       B    3       B    0
4            C    0       C    0       C    1       C    0       C    0
5            D    0       D    0       D   29       D    0       D    0
6            E    0       E    0       E    0       E    0       E    0
7            F    0       F    0       F    0       F    0       F    0
8            G    0       G    0       G    0       G    0       G    0
9            H    0       H    0       H    0       H    0       H    0
10  02/01/2023  NaN     NaN  NaN     NaN  NaN     NaN  NaN     NaN  NaN
11      Venue1  QTY  Venue2  QTY  Venue3  QTY  Venue4  QTY  Venue5  QTY
12           A    0       A    0       A    1       A    0       A    0
13           B   11       B    3       B    0       B    6       B    2
14           C    0       C    0       C    0       C    0       C    0
15           D   20       D    0       D   28       D    0       D   24
16           E    0       E    0       E    0       E    0       E    0
17           F    0       F    0       F    0       F    0       F    0
18           G    0       G    0       G    0       G    0       G    0
19           H    0       H    0       H    0       H    0       H    0

必填格式

我试过在pandas中进行操作,但我不确定如何准确地进行操作以获得预期的结果。任何建议或示例代码将不胜感激。谢谢你!

lskq00tm

lskq00tm1#

下面是使用pandas reshaping的一种方法:

tmp = pd.read_excel("file.xlsx", header=None)

m = pd.to_datetime(tmp[0], errors="coerce").notnull()

blocks = {n: g.set_axis(g.iloc[0], axis=1).iloc[1:]
          for n,g in tmp.loc[~m].groupby(tmp[0].where(m).ffill())}
df = (
    pd.concat(blocks, names=["Date"])
        .assign(Venues= lambda x: x["Venue1"])
        .pipe(lambda x: x.set_axis(
            [f"{col}" if i%2 == 0 else f"QTY_{x.columns[i-1]}"
             for i, col in enumerate(x.columns)], axis=1))
        .filter(regex="QTY.+|Venues").set_index("Venues", append=True)
     .rename(lambda x: x.split("_")[1], axis=1).droplevel(1).unstack(1)
     .stack(0).reset_index(names=["Date", "Venues"]).rename_axis(columns=None)
)

输出:

print(df)

         Date  Venues  A   B  C   D  E  F  G  H
0  01/01/2023  Venue1  0  17  0   0  0  0  0  0
1  01/01/2023  Venue2  0   3  0   0  0  0  0  0
2  01/01/2023  Venue3  1  11  1  29  0  0  0  0
3  01/01/2023  Venue4  0   3  0   0  0  0  0  0
4  01/01/2023  Venue5  0   0  0   0  0  0  0  0
5  02/01/2023  Venue1  0  11  0  20  0  0  0  0
6  02/01/2023  Venue2  0   3  0   0  0  0  0  0
7  02/01/2023  Venue3  1   0  0  28  0  0  0  0
8  02/01/2023  Venue4  0   6  0   0  0  0  0  0
9  02/01/2023  Venue5  0   2  0  24  0  0  0  0

相关问题