pandas Python添加与列值关联的权重

zlwx9yxi  于 2023-01-28  发布在  Python
关注(0)|答案(2)|浏览(133)

我正在处理一个超大的数据库。下面是一个示例:

  1. import pandas as pd
  2. import numpy as np
  3. df = pd.DataFrame({
  4. 'ID': ['A', 'A', 'A', 'X', 'X', 'Y'],
  5. })
  6. ID
  7. 0 A
  8. 1 A
  9. 2 A
  10. 3 X
  11. 4 X
  12. 5 Y

现在,给定列“”“ID”“”中每个值的频率,我想使用下面的函数计算权重,并添加一列,该列的权重与“”“ID”"“中的每个值相关联。

  1. def get_weights_inverse_num_of_samples(label_counts, power=1.):
  2. no_of_classes = len(label_counts)
  3. weights_for_samples = 1.0/np.power(np.array(label_counts), power)
  4. weights_for_samples = weights_for_samples/ np.sum(weights_for_samples)*no_of_classes
  5. return weights_for_samples
  6. freq = df.value_counts()
  7. print(freq)
  8. ID
  9. A 3
  10. X 2
  11. Y 1
  12. weights = get_weights_inverse_num_of_samples(freq)
  13. print(weights)
  14. [0.54545455 0.81818182 1.63636364]

因此,我正在寻找一种有效的方法来获得这样的 Dataframe 给定上述权重:

  1. ID sample_weight
  2. 0 A 0.54545455
  3. 1 A 0.54545455
  4. 2 A 0.54545455
  5. 3 X 0.81818182
  6. 4 X 0.81818182
  7. 5 Y 1.63636364
8yparm6h

8yparm6h1#

如果您更多地依赖duck-typing,则可以重写函数以返回与输出相同的输入类型。
这将使您不必在调用.map之前显式地返回到.index

  1. import pandas as pd
  2. df = pd.DataFrame({'ID': ['A', 'A', 'A', 'X', 'X', 'Y'})
  3. def get_weights_inverse_num_of_samples(label_counts, power=1):
  4. """Using object methods here instead of coercing to numpy ndarray"""
  5. no_of_classes = len(label_counts)
  6. weights_for_samples = 1 / (label_counts ** power)
  7. return weights_for_samples / weights_for_samples.sum() * no_of_classes
  8. # select the column before using `.value_counts()`
  9. # this saves us from ending up with a `MultiIndex` Series
  10. freq = df['ID'].value_counts()
  11. weights = get_weights_inverse_num_of_samples(freq)
  12. print(weights)
  13. # A 0.545455
  14. # X 0.818182
  15. # Y 1.636364
  16. # note that now our weights are still a `pd.Series`
  17. # that we can align directly against our `"ID"` column
  18. df['sample_weight'] = df['ID'].map(weights)
  19. print(df)
  20. # ID sample_weight
  21. # 0 A 0.545455
  22. # 1 A 0.545455
  23. # 2 A 0.545455
  24. # 3 X 0.818182
  25. # 4 X 0.818182
  26. # 5 Y 1.636364
展开查看全部
6za6bjd0

6za6bjd02#

您可以map这些值:

  1. df['sample_weight'] = df['ID'].map(dict(zip(freq.index.get_level_values(0), weights)))
  • 注意:value_counts返回一个单级别的MultiIndex,因此需要get_level_values。*

如@ScottBoston所述,更好的方法是使用:

  1. freq = df['ID'].value_counts()
  2. df['sample_weight'] = df['ID'].map(dict(zip(freq.index, weights)))

输出:

  1. ID sample_weight
  2. 0 A 0.545455
  3. 1 A 0.545455
  4. 2 A 0.545455
  5. 3 X 0.818182
  6. 4 X 0.818182
  7. 5 Y 1.636364
展开查看全部

相关问题