numpy 在pandas groupby循环中将行分配给对象的最快方法

dauxcl2d  于 2023-10-19  发布在  其他
关注(0)|答案(1)|浏览(96)

好了,我有两个框架:

  1. df = pd.DataFrame({'A':['German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})
  2. df = df.T
  3. df.columns = df.iloc[0]
  4. df = df.drop(df.index[0])
  5. A German Shepherd Border Collie Golden Retriever Beagle Daschund
  6. df2 = pd.DataFrame({'ID':['A','A','A','B','C','C','C','C','C'],
  7. 'Breed':['German Shepherd','Beagle','Dashung','Border Collie',
  8. 'German Shepherd','Border Collie','Golden Retriever','Beagle','Daschund']})
  9. ID Breed
  10. 0 A German Shepherd
  11. 1 A Beagle
  12. 2 A Dashung
  13. 3 B Border Collie
  14. 4 C German Shepherd
  15. 5 C Border Collie
  16. 6 C Golden Retriever
  17. 7 C Beagle
  18. 8 C Daschund

我想在df 2中找到狗品种的ID,然后更新df,如果它存在于该ID中:

  1. dogs_grouped = df2.groupby('ID')
  2. missing_dogs = []
  3. vals = [np.nan for i in df.columns]
  4. for group_name, df_group in dogs_grouped:
  5. print(f'Cluster: {group_name}')
  6. cluster_dogs = sorted(list(set(df_group['Breed'].to_list())))
  7. cluster_dogs = [i for i in cluster_dogs if i in all_dogs]
  8. weird_dogs = [i for i in cluster_dogs if i not in all_dogs]
  9. missing_dogs.append(weird_dogs)
  10. df = df.append(pd.Series(vals, index=df.columns, name=group_name))
  11. df.loc[group_name][cluster_dogs] = 1
  12. df = df.fillna(0)

我的代码可以工作,但对于大型数据集来说非常慢。我有一个50万行的数据集,我正在迭代,创建一个4000 x 30,000的矩阵需要几个小时。

  1. A German Shepherd Border Collie Golden Retriever Beagle Daschund
  2. A 1 0 0 1 0
  3. B 0 1 0 0 0
  4. C 1 1 1 1 1

必须有一个更pythonic/Pandas的方式来处理这个问题?

jhdbpxl9

jhdbpxl91#

我认为你只是想要pd.crosstab(如果某些值(列)丢失,你可以从df1中的值重新索引列)

  1. x = pd.crosstab(df2["ID"], df2["Breed"])
  2. print(x)

图纸:

  1. Breed Beagle Border Collie Daschund Dashung German Shepherd Golden Retriever
  2. ID
  3. A 1 0 0 1 1 0
  4. B 0 1 0 0 0 0
  5. C 1 1 1 0 1 1

然后是.reindex

  1. x = x.reindex(
  2. columns=[
  3. "Some New Breed",
  4. "German Shepherd",
  5. "Border Collie",
  6. "Golden Retriever",
  7. "Beagle",
  8. "Daschund",
  9. ],
  10. fill_value=0,
  11. )
  12. print(x)

图纸:

  1. Breed Some New Breed German Shepherd Border Collie Golden Retriever Beagle Daschund
  2. ID
  3. A 0 1 0 0 1 0
  4. B 0 0 1 0 0 0
  5. C 0 1 1 1 1 1
展开查看全部

相关问题