pandas panda分类不排序多索引

k7fdbhmy  于 2023-02-17  发布在  其他
关注(0)|答案(1)|浏览(253)

我从SQL中提取了一些CSV格式的数据:

  1. Year,Decision,Residency,Class,Count
  2. 2019,Applied,Resident,Freshmen,1143
  3. 2019,Applied,Resident,Transfer,404
  4. 2019,Applied," ",Grad/Postbacc,418
  5. 2019,Applied,Non-Resident,Freshmen,1371
  6. 2019,Applied,Non-Resident,Transfer,371
  7. 2019,Admitted,Resident,Freshmen,918
  8. 2019,Admitted,Resident,Transfer,358
  9. 2019,Admitted," ",Grad/Postbacc,311
  10. 2019,Admitted,Non-Resident,Freshmen,1048
  11. 2019,Admitted,Non-Resident,Transfer,313
  12. 2020,Applied,Resident,Freshmen,1094
  13. 2020,Applied,Resident,Transfer,406
  14. 2020,Applied," ",Grad/Postbacc,374
  15. 2020,Applied,Non-Resident,Freshmen,1223
  16. 2020,Applied,Non-Resident,Transfer,356
  17. 2020,Admitted,Resident,Freshmen,1003
  18. 2020,Admitted,Resident,Transfer,354
  19. 2020,Admitted," ",Grad/Postbacc,282
  20. 2020,Admitted,Non-Resident,Freshmen,1090
  21. 2020,Admitted,Non-Resident,Transfer,288

我编写了一个如下的转换:

  1. data = pd.read_csv("Data.csv")
  2. #Categorize the rows
  3. data["Class"] = pd.Categorical(data["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
  4. data["Decision"] = pd.Categorical(data["Decision"],["Applied","Admitted"],ordered=True)
  5. data["Residency"] = pd.Categorical(data["Residency"],["Resident","Non-Resident"],ordered=True)
  6. #Subtotal classes
  7. tmp = data.groupby(["Year","Class","Decision"],sort=False).sum("Count")
  8. tmp["Residency"] = "Total"
  9. tmp.reset_index(inplace=True)
  10. tmp = pd.concat([data,tmp],ignore_index=True)
  11. #Grand total
  12. tmp2 = data.groupby(["Year","Decision"],sort=False).sum("Count")
  13. tmp2["Class"] = "Grand"
  14. tmp2["Residency"] = "Total"
  15. tmp2.reset_index(inplace=True)
  16. tmp = pd.concat([tmp,tmp2],ignore_index=True)
  17. #Crosstab it
  18. tmp = pd.crosstab(index=[tmp["Year"],tmp["Class"],tmp["Residency"]],
  19. columns=[tmp["Decision"]],
  20. values=tmp["Count"],
  21. aggfunc="sum")
  22. tmp = tmp.loc[~(tmp==0).all(axis=1)]
  23. tmp["%"] = np.round(100*tmp["Admitted"]/tmp["Applied"],1)
  24. tmp = tmp.stack().unstack(["Year","Decision"])
  25. print(tmp)

其输出如下:

  1. Year 2019 2020
  2. Decision Applied Admitted % Applied Admitted %
  3. Class Residency
  4. Freshmen Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
  5. Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
  6. Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
  7. Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
  8. Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4
  9. Transfer Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
  10. Resident 404.0 358.0 88.6 406.0 354.0 87.2
  11. Total 775.0 671.0 86.6 762.0 642.0 84.3

预期输出为

  1. Year 2019 2020
  2. Decision Applied Admitted % Applied Admitted %
  3. Class Residency
  4. Freshmen Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
  5. Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
  6. Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
  7. Transfer Resident 404.0 358.0 88.6 406.0 354.0 87.2
  8. Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
  9. Total 775.0 671.0 86.6 762.0 642.0 84.3
  10. Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
  11. Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4

分类成功地正确地自我排序,直到我把 Dataframe 扔到pd.crosstab中,在这一点上,它都福尔斯了。

9avjhtql

9avjhtql1#

我无法修复您的代码,但我得到了预期的结果这样做:

  1. import pandas as pd
  2. df = pd.read_csv("Data.csv")
  3. df["Class"] = pd.Categorical(df["Class"],["Freshmen","Transfer","Grad/Postbacc","Grand"],ordered=True)
  4. df["Decision"] = pd.Categorical(df["Decision"],["Applied","Admitted","%"],ordered=True)
  5. df["Residency"] = pd.Categorical(df["Residency"],["Resident","Non-Resident"," "],ordered=True)
  6. df_grouped = df.groupby(['Year', 'Decision', 'Class', 'Residency'],as_index=False)['Count'].sum()
  7. df_pivot = df_grouped.pivot_table(columns=["Year","Decision"],index=["Class","Residency"], values="Count",aggfunc='sum')
  8. #Create subtotal for rows
  9. df_totals = pd.concat([y.append(y.sum().rename((x, 'Total'))) for x, y in df_pivot.groupby(level=0)]).append(df_pivot.sum().rename(('Grand', 'Total')))
  10. #Drop not wanted rows
  11. df_totals = df_totals[~(df_totals.values == 0).all(axis=1)].drop_duplicates(keep="last")
  12. #Calculate "%" columns
  13. for year in df_totals.columns.get_level_values('Year').unique():
  14. df_totals[year, '%'] = df_totals[year, 'Admitted'] / df_totals[year, 'Applied']
  15. df_totals

输出:

  1. Year 2019 2020
  2. Decision Applied Admitted % Applied Admitted %
  3. Class Residency
  4. Freshmen Resident 1143.0 918.0 80.3 1094.0 1003.0 91.7
  5. Non-Resident 1371.0 1048.0 76.4 1223.0 1090.0 89.1
  6. Total 2514.0 1966.0 78.2 2317.0 2093.0 90.3
  7. Transfer Resident 404.0 358.0 88.6 406.0 354.0 87.2
  8. Non-Resident 371.0 313.0 84.4 356.0 288.0 80.9
  9. Total 775.0 671.0 86.6 762.0 642.0 84.3
  10. Grad/Postbacc Total 418.0 311.0 74.4 374.0 282.0 75.4
  11. Grand Total 3707.0 2948.0 79.5 3453.0 3017.0 87.4

注意:我收到了关于df.append()的警告

展开查看全部

相关问题