我从SQL中提取了一些CSV格式的数据:
Year,Decision,Residency,Class,Count
2019,Applied,Resident,Freshmen,1143
2019,Applied,Resident,Transfer,404
2019,Applied," ",Grad/Postbacc,418
2019,Applied,Non-Resident,Freshmen,1371
2019,Applied,Non-Resident,Transfer,371
2019,Admitted,Resident,Freshmen,918
2019,Admitted,Resident,Transfer,358
2019,Admitted," ",Grad/Postbacc,311
2019,Admitted,Non-Resident,Freshmen,1048
2019,Admitted,Non-Resident,Transfer,313
2020,Applied,Resident,Freshmen,1094
2020,Applied,Resident,Transfer,406
2020,Applied," ",Grad/Postbacc,374
2020,Applied,Non-Resident,Freshmen,1223
2020,Applied,Non-Resident,Transfer,356
2020,Admitted,Resident,Freshmen,1003
2020,Admitted,Resident,Transfer,354
2020,Admitted," ",Grad/Postbacc,282
2020,Admitted,Non-Resident,Freshmen,1090
2020,Admitted,Non-Resident,Transfer,288
我编写了一个如下的转换:
data = pd.read_csv("Data.csv")
#Categorize the rows
data["Class"] = pd.Categorical(data["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
data["Decision"] = pd.Categorical(data["Decision"],["Applied","Admitted"],ordered=True)
data["Residency"] = pd.Categorical(data["Residency"],["Resident","Non-Resident"],ordered=True)
#Subtotal classes
tmp = data.groupby(["Year","Class","Decision"],sort=False).sum("Count")
tmp["Residency"] = "Total"
tmp.reset_index(inplace=True)
tmp = pd.concat([data,tmp],ignore_index=True)
#Grand total
tmp2 = data.groupby(["Year","Decision"],sort=False).sum("Count")
tmp2["Class"] = "Grand"
tmp2["Residency"] = "Total"
tmp2.reset_index(inplace=True)
tmp = pd.concat([tmp,tmp2],ignore_index=True)
#Crosstab it
tmp = pd.crosstab(index=[tmp["Year"],tmp["Class"],tmp["Residency"]],
columns=[tmp["Decision"]],
values=tmp["Count"],
aggfunc="sum")
tmp = tmp.loc[~(tmp==0).all(axis=1)]
tmp["%"] = np.round(100*tmp["Admitted"]/tmp["Applied"],1)
tmp = tmp.stack().unstack(["Year","Decision"])
print(tmp)
其输出如下:
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
Transfer Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Resident 404.0 358.0 88.6 406.0 354.0 87.2
Total 775.0 671.0 86.6 762.0 642.0 84.3
预期输出为
Year 2019 2020
Decision Applied Admitted % Applied Admitted %
Class Residency
Freshmen Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
Transfer Resident 404.0 358.0 88.6 406.0 354.0 87.2
Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
Total 775.0 671.0 86.6 762.0 642.0 84.3
Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
分类成功地正确地自我排序,直到我把 Dataframe 扔到pd.crosstab
中,在这一点上,它都福尔斯了。
1条答案
按热度按时间9avjhtql1#
我无法修复您的代码,但我得到了预期的结果这样做:
输出:
注意:我收到了关于
df.append()
的警告