如果Pandas Dataframe (无循环)中的条件为真,则更改列中的一些值

7gcisfzg  于 2022-12-17  发布在  其他
关注(0)|答案(1)|浏览(134)

我有以下 Dataframe :

d_test = {
    'random_staff' : ['gfda', 'fsd','gec', 'erw', 'gd', 'kjhk', 'fd', 'kui'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)

cluster_number列包含从1n的值。某些值可能会重复,但不会显示缺失值。例如,上述这些值为:一米三氮一x一米四氮一x一米五氮一x一米六氮一x
我希望能够从cluster_number列中选择某个值,并将该值的每次出现更改为唯一值集。不应显示缺失值。例如,如果我们选择值2,则cluster_number的理想结果为[1, 2, 3, 3, 5, 1, 4, 6]。请注意,该列中有三个2。我们保留第一个为2,将下一个出现的2更改为5,将最后一个出现的2更改为6
我为上面的逻辑编写了代码,它运行良好:

cluster_number_to_change = 2
max_cluster = max(df_test['cluster_number'])
first_iter = True
i = cluster_number_to_change
for index, row in df_test.iterrows():
    if row['cluster_number'] == cluster_number_to_change:
        df_test.loc[index, 'cluster_number'] = i
        if first_iter:
            i = max_cluster + 1
            first_iter = False
        else:
            i += 1

但它是写为for循环,我试图了解,如果可以转换成Pandas.apply方法(或任何其他有效的矢量化解决方案)的形式。

1wnzp6jl

1wnzp6jl1#

使用布尔索引:

# get cluster #2
m1 = df_test['cluster_number'].eq(2)
# identify duplicates
m2 = df_test['cluster_number'].duplicated()

# increment duplicates using the max as reference
df_test.loc[m1&m2, 'cluster_number']  = (
 m2.where(m1).cumsum()
   .add(df_test['cluster_number'].max())
   .convert_dtypes()
)

print(df_test)

输出:

random_staff  cluster_number
0         gfda               1
1          fsd               2
2          gec               3
3          erw               3
4           gd               5
5         kjhk               1
6           fd               4
7          kui               6

相关问题