通过反向聚合转换的棘手长透视(Pandas)

ezykj2lf  于 2023-04-10  发布在  其他
关注(0)|答案(4)|浏览(78)

我有一个数据集,我想在其中将值解聚合到它们自己的唯一行中,并执行透视,按类别分组。

数据已更新

Period      Date        Area    BB stat AA stat CC stat DD stat BB test AA test CC test DD test BB re   AA re   CC re BB test2  AA test2 CC test2  DD test2                                                
8/1/2016    9/1/2016    NY      5       5       5               1       1       1               0       0       0     0          0       0         0
9/1/2016    10/1/2016   NY      6       6       6               4       4       4               0       0       0     0          0       0         0
8/1/2016    9/1/2016    CA      2       2       2               4       4       4               0       0       0     0          0       0         0
9/1/2016    10/1/2016   CA      1       1       1              -2      -2      -2               0       0       0     0          0       0         0

期望

Period      Date            Area    stat    test    type    re  test2
8/1/2016    9/1/2016        NY      5       1       BB      0   0
9/1/2016    10/1/2016       NY      6       4       BB      0   0
8/1/2016    9/1/2016        NY      5       1       AA      0   0   
9/1/2016    10/1/2016       NY      6       4       AA      0   0   
8/1/2016    9/1/2016        NY      5       1       CC      0   0
9/1/2016    10/1/2016       NY      6       4       CC      0   0   
8/1/2016    9/1/2016        NY      0       0       DD      0   0
9/1/2016    10/1/2016       NY      0       0       DD      0   0
8/1/2016    9/1/2016        CA      2       4       BB      0   0
9/1/2016    10/1/2016       CA      1       -2      BB      0   0
8/1/2016    9/1/2016        CA      2       4       AA      0   0
9/1/2016    10/1/2016       CA      1       -2      AA      0   0
8/1/2016    9/1/2016        CA      2       4       CC      0   0
9/1/2016    10/1/2016       CA      1       -2      CC      0   0
8/1/2016    9/1/2016        CA      0       0       DD      0   0
9/1/2016    10/1/2016       CA      0       0       DD      0   0

value_vars = ["BB stat",    "AA stat",  "CC stat",  "DD stat",  "BB test",
"AA test",  "CC test",  "DD test",  "BB re",    "AA re",    "CC re"]
df = df.melt(id_vars=["Period", "Date", "Area"], value_vars=value_vars)

temp_df = df.variable.str.split("_", 1, expand=True)
df["type"] = temp_df[0]
df["name"] = temp_df[1]
df = df.drop(columns=["variable"])
first_half = df.iloc[:len(df)//2]
second_half = df.iloc[len(df)//2:]
df = pd.merge(first_half, second_half, on=["Period", "Date", "Area", "type"], suffixes=("_1", "_2"))

df.rename(columns = {'value_3':'stat''value_2':'test', 'value_1':'re'}, inplace = True)
df.drop(columns=["name_1", "name_2"], inplace=True)
df = df[[ "Period",     "Date",         "Area", "stat", "test", "type", "re"    ]]


df.sort_values(["Area", "type"], ascending=False, inplace=True)
df.to_markdown()

以下代码无法捕获所有输出列。欢迎提出任何建议。

dly7yett

dly7yett1#

尝试pd.wide_to_long

pd.wide_to_long(df, 
                stubnames=['AA', 'BB','CC','DD'],
                i=['Period','Date','Area'],
                j='',
                sep=' ',
                suffix='(test|re|stat)'
).unstack(level=-1, fill_value=0).stack(level=0).reset_index()

输出:

Period       Date Area type   re  stat  test
0   8/1/2016   9/1/2016   CA   AA  0.0   2.0   4.0
1   8/1/2016   9/1/2016   CA   BB  0.0   2.0   4.0
2   8/1/2016   9/1/2016   CA   CC  0.0   2.0   4.0
3   8/1/2016   9/1/2016   CA   DD  NaN   0.0   0.0
4   8/1/2016   9/1/2016   NY   AA  0.0   5.0   1.0
5   8/1/2016   9/1/2016   NY   BB  0.0   5.0   1.0
6   8/1/2016   9/1/2016   NY   CC  0.0   5.0   1.0
7   8/1/2016   9/1/2016   NY   DD  NaN   0.0   0.0
8   9/1/2016  10/1/2016   CA   AA  0.0   1.0  -2.0
9   9/1/2016  10/1/2016   CA   BB  0.0   1.0  -2.0
10  9/1/2016  10/1/2016   CA   CC  0.0   1.0  -2.0
11  9/1/2016  10/1/2016   CA   DD  NaN   0.0   0.0
12  9/1/2016  10/1/2016   NY   AA  0.0   6.0   4.0
13  9/1/2016  10/1/2016   NY   BB  0.0   6.0   4.0
14  9/1/2016  10/1/2016   NY   CC  0.0   6.0   4.0
15  9/1/2016  10/1/2016   NY   DD  NaN   0.0   0.0
vcirk6k6

vcirk6k62#

问得好。
我认为我们需要做的是使用a pivot table

import pandas as pd

# OP data
data = {'Period': ['8/1/2016', '9/1/2016', '8/1/2016', '9/1/2016'],
        'Date': ['9/1/2016', '10/1/2016', '9/1/2016', '10/1/2016'],
        'Area': ['NY', 'NY', 'CA', 'CA'],
        'BB stat': [5, 6, 2, 1],
        'AA stat': [5, 6, 2, 1],
        'CC stat': [5, 6, 2, 1],
        'DD stat': [0, 0, 0, 0],
        'BB test': [1, 4, 4, -2],
        'AA test': [1, 4, 4, -2],
        'CC test': [1, 4, 4, -2],
        'DD test': [0, 0, 0, 0],
        'BB re': [0, 0, 0, 0],
        'AA re': [0, 0, 0, 0],
        'CC re': [0, 0, 0, 0]}

# Convert the dictionary to a DataFrame and process it
df = pd.DataFrame(data) \
    .melt(id_vars=["Period", "Date", "Area"]) \
    .assign(**pd.DataFrame(df.melt(id_vars=["Period", "Date", "Area"])['variable'].str.extract(r'([A-Z]{2})\s([a-z]{2,4})', expand=True).values, columns=['type', 'category'])) \
    .drop(columns=['variable']) \
    .pivot_table(index=["Period", "Date", "Area", "type"], columns="category", values="value").reset_index() \
    .reindex(columns=['Period', 'Date', 'Area', 'stat', 'test', 'type', 're']) \
    .sort_values(['Area', 'type'], ascending=False)
ffscu2ro

ffscu2ro3#

一个选项是使用pivot_longer from pyjanitor-在这种情况下,我们使用特殊的占位符.value来标识列中我们希望保留为标题的部分,而其余部分则被整理到新列中:

# pip install pyjanitor
import pandas as pd
import janitor

(df
.pivot_longer(
    index = ['Period', 'Date', 'Area'], 
    names_to = ('type', '.value'), 
    names_sep = " ",
    sort_by_appearance=True)
.fillna({"re":0}, downcast='infer')
) 
      Period       Date Area type  stat  test  re  test2
0   8/1/2016   9/1/2016   NY   BB     5     1   0      0
1   8/1/2016   9/1/2016   NY   AA     5     1   0      0
2   8/1/2016   9/1/2016   NY   CC     5     1   0      0
3   8/1/2016   9/1/2016   NY   DD     0     0   0      0
4   9/1/2016  10/1/2016   NY   BB     6     4   0      0
5   9/1/2016  10/1/2016   NY   AA     6     4   0      0
6   9/1/2016  10/1/2016   NY   CC     6     4   0      0
7   9/1/2016  10/1/2016   NY   DD     0     0   0      0
8   8/1/2016   9/1/2016   CA   BB     2     4   0      0
9   8/1/2016   9/1/2016   CA   AA     2     4   0      0
10  8/1/2016   9/1/2016   CA   CC     2     4   0      0
11  8/1/2016   9/1/2016   CA   DD     0     0   0      0
12  9/1/2016  10/1/2016   CA   BB     1    -2   0      0
13  9/1/2016  10/1/2016   CA   AA     1    -2   0      0
14  9/1/2016  10/1/2016   CA   CC     1    -2   0      0
15  9/1/2016  10/1/2016   CA   DD     0     0   0      0

另一个选项,使用pd.stack

index = ['Period', 'Date', 'Area']
temp = df.set_index(index)
# default separator is white space
temp.columns = temp.columns.str.split(expand = True)
temp.columns.names = ['type', None]
(temp
.stack("type")
.fillna({"re":0}, downcast='infer')
.reset_index()
)

      Period       Date Area type  re  stat  test  test2
0   8/1/2016   9/1/2016   NY   AA   0     5     1      0
1   8/1/2016   9/1/2016   NY   BB   0     5     1      0
2   8/1/2016   9/1/2016   NY   CC   0     5     1      0
3   8/1/2016   9/1/2016   NY   DD   0     0     0      0
4   9/1/2016  10/1/2016   NY   AA   0     6     4      0
5   9/1/2016  10/1/2016   NY   BB   0     6     4      0
6   9/1/2016  10/1/2016   NY   CC   0     6     4      0
7   9/1/2016  10/1/2016   NY   DD   0     0     0      0
8   8/1/2016   9/1/2016   CA   AA   0     2     4      0
9   8/1/2016   9/1/2016   CA   BB   0     2     4      0
10  8/1/2016   9/1/2016   CA   CC   0     2     4      0
11  8/1/2016   9/1/2016   CA   DD   0     0     0      0
12  9/1/2016  10/1/2016   CA   AA   0     1    -2      0
13  9/1/2016  10/1/2016   CA   BB   0     1    -2      0
14  9/1/2016  10/1/2016   CA   CC   0     1    -2      0
15  9/1/2016  10/1/2016   CA   DD   0     0     0      0
4si2a6ki

4si2a6ki4#

另一种可能的解决方案:

cols = ['Period', 'Date', 'Area']
out = df.melt(cols)
out = (pd.concat([out, 
                  out['variable'].str.split(' ', expand=True)
                  .set_axis(['type', 'aux'], axis=1)], axis=1)
       .drop(['variable'], axis=1))
(out.pivot(index = cols + ['type'], columns='aux').droplevel(0, axis=1)
 .reset_index().rename_axis(None, axis=1).fillna(0))

输出:

Period       Date Area type   re  stat  test
0   8/1/2016   9/1/2016   CA   AA  0.0   2.0   4.0
1   8/1/2016   9/1/2016   CA   BB  0.0   2.0   4.0
2   8/1/2016   9/1/2016   CA   CC  0.0   2.0   4.0
3   8/1/2016   9/1/2016   CA   DD  0.0   0.0   0.0
4   8/1/2016   9/1/2016   NY   AA  0.0   5.0   1.0
5   8/1/2016   9/1/2016   NY   BB  0.0   5.0   1.0
6   8/1/2016   9/1/2016   NY   CC  0.0   5.0   1.0
7   8/1/2016   9/1/2016   NY   DD  0.0   0.0   0.0
8   9/1/2016  10/1/2016   CA   AA  0.0   1.0  -2.0
9   9/1/2016  10/1/2016   CA   BB  0.0   1.0  -2.0
10  9/1/2016  10/1/2016   CA   CC  0.0   1.0  -2.0
11  9/1/2016  10/1/2016   CA   DD  0.0   0.0   0.0
12  9/1/2016  10/1/2016   NY   AA  0.0   6.0   4.0
13  9/1/2016  10/1/2016   NY   BB  0.0   6.0   4.0
14  9/1/2016  10/1/2016   NY   CC  0.0   6.0   4.0
15  9/1/2016  10/1/2016   NY   DD  0.0   0.0   0.0

相关问题