pandas panda分类不排序多索引

k7fdbhmy  于 2023-02-17  发布在  其他
关注(0)|答案(1)|浏览(219)

我从SQL中提取了一些CSV格式的数据:

Year,Decision,Residency,Class,Count
2019,Applied,Resident,Freshmen,1143
2019,Applied,Resident,Transfer,404
2019,Applied," ",Grad/Postbacc,418
2019,Applied,Non-Resident,Freshmen,1371
2019,Applied,Non-Resident,Transfer,371
2019,Admitted,Resident,Freshmen,918
2019,Admitted,Resident,Transfer,358
2019,Admitted," ",Grad/Postbacc,311
2019,Admitted,Non-Resident,Freshmen,1048
2019,Admitted,Non-Resident,Transfer,313
2020,Applied,Resident,Freshmen,1094
2020,Applied,Resident,Transfer,406
2020,Applied," ",Grad/Postbacc,374
2020,Applied,Non-Resident,Freshmen,1223
2020,Applied,Non-Resident,Transfer,356
2020,Admitted,Resident,Freshmen,1003
2020,Admitted,Resident,Transfer,354
2020,Admitted," ",Grad/Postbacc,282
2020,Admitted,Non-Resident,Freshmen,1090
2020,Admitted,Non-Resident,Transfer,288

我编写了一个如下的转换:

data = pd.read_csv("Data.csv")
#Categorize the rows
data["Class"]     = pd.Categorical(data["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
data["Decision"]  = pd.Categorical(data["Decision"],["Applied","Admitted"],ordered=True)
data["Residency"] = pd.Categorical(data["Residency"],["Resident","Non-Resident"],ordered=True)

#Subtotal classes
tmp = data.groupby(["Year","Class","Decision"],sort=False).sum("Count")
tmp["Residency"] = "Total"
tmp.reset_index(inplace=True)
tmp = pd.concat([data,tmp],ignore_index=True)

#Grand total
tmp2 = data.groupby(["Year","Decision"],sort=False).sum("Count")
tmp2["Class"]     = "Grand"
tmp2["Residency"] = "Total"
tmp2.reset_index(inplace=True)
tmp = pd.concat([tmp,tmp2],ignore_index=True)

#Crosstab it
tmp = pd.crosstab(index=[tmp["Year"],tmp["Class"],tmp["Residency"]],
                  columns=[tmp["Decision"]],
                  values=tmp["Count"],
                  aggfunc="sum")
tmp = tmp.loc[~(tmp==0).all(axis=1)]
tmp["%"] = np.round(100*tmp["Admitted"]/tmp["Applied"],1)
tmp = tmp.stack().unstack(["Year","Decision"])
print(tmp)

其输出如下:

Year                          2019                   2020               
Decision                   Applied Admitted     % Applied Admitted     %
Class         Residency                                                 
Freshmen      Non-Resident  1371.0   1048.0  76.4  1223.0   1090.0  89.1
              Resident      1143.0    918.0  80.3  1094.0   1003.0  91.7
              Total         2514.0   1966.0  78.2  2317.0   2093.0  90.3
Grad/Postbacc Total          418.0    311.0  74.4   374.0    282.0  75.4
Grand         Total         3707.0   2948.0  79.5  3453.0   3017.0  87.4
Transfer      Non-Resident   371.0    313.0  84.4   356.0    288.0  80.9
              Resident       404.0    358.0  88.6   406.0    354.0  87.2
              Total          775.0    671.0  86.6   762.0    642.0  84.3

预期输出为

Year                          2019                   2020               
Decision                   Applied Admitted     % Applied Admitted     %
Class         Residency                                                 
Freshmen      Resident      1143.0    918.0  80.3  1094.0   1003.0  91.7
              Non-Resident  1371.0   1048.0  76.4  1223.0   1090.0  89.1
              Total         2514.0   1966.0  78.2  2317.0   2093.0  90.3
Transfer      Resident       404.0    358.0  88.6   406.0    354.0  87.2
              Non-Resident   371.0    313.0  84.4   356.0    288.0  80.9
              Total          775.0    671.0  86.6   762.0    642.0  84.3
Grad/Postbacc Total          418.0    311.0  74.4   374.0    282.0  75.4
Grand         Total         3707.0   2948.0  79.5  3453.0   3017.0  87.4

分类成功地正确地自我排序,直到我把 Dataframe 扔到pd.crosstab中,在这一点上,它都福尔斯了。

9avjhtql

9avjhtql1#

我无法修复您的代码,但我得到了预期的结果这样做:

import pandas as pd

df = pd.read_csv("Data.csv")

df["Class"]     = pd.Categorical(df["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
df["Decision"]  = pd.Categorical(df["Decision"],["Applied","Admitted","%"],ordered=True)
df["Residency"] = pd.Categorical(df["Residency"],["Resident","Non-Resident"," "],ordered=True)

df_grouped = df.groupby(['Year', 'Decision', 'Class', 'Residency'],as_index=False)['Count'].sum()

df_pivot = df_grouped.pivot_table(columns=["Year","Decision"],index=["Class","Residency"], values="Count",aggfunc='sum')

#Create subtotal for rows
df_totals = pd.concat([y.append(y.sum().rename((x, 'Total'))) for x, y in df_pivot.groupby(level=0)]).append(df_pivot.sum().rename(('Grand', 'Total')))

#Drop not wanted rows
df_totals = df_totals[~(df_totals.values == 0).all(axis=1)].drop_duplicates(keep="last")

#Calculate "%" columns
for year in df_totals.columns.get_level_values('Year').unique():
    df_totals[year, '%'] = df_totals[year, 'Admitted'] / df_totals[year, 'Applied']

df_totals

输出:

Year                          2019                   2020               
Decision                   Applied Admitted     % Applied Admitted     %
Class         Residency                                                 
Freshmen      Resident      1143.0    918.0  80.3  1094.0   1003.0  91.7
              Non-Resident  1371.0   1048.0  76.4  1223.0   1090.0  89.1
              Total         2514.0   1966.0  78.2  2317.0   2093.0  90.3
Transfer      Resident       404.0    358.0  88.6   406.0    354.0  87.2
              Non-Resident   371.0    313.0  84.4   356.0    288.0  80.9
              Total          775.0    671.0  86.6   762.0    642.0  84.3
Grad/Postbacc Total          418.0    311.0  74.4   374.0    282.0  75.4
Grand         Total         3707.0   2948.0  79.5  3453.0   3017.0  87.4

注意:我收到了关于df.append()的警告

相关问题