在pandas dataframe中创建唯一ID,如果组ID相同且相似性得分为100,则UID应该相同,否则不相同

zwghvu4y  于 2023-03-28  发布在  其他
关注(0)|答案(3)|浏览(101)

我有以下 Dataframe :

GID    Similarity Score
71      100
71      100
132     100
132      40
132     100
132      40
104      35
104      35
112       0
114       0

我想在pandas dataframe中创建一个名为Unique ID的新列,如果GID相同且相似性得分为100,则必须分配相同的唯一ID。如果这些条件不匹配,则必须分配不同的唯一ID,我希望输出为以下格式:

GID    Similarity Score    UID
71      100                 900
71      100                 900
132     100                 901
132      40                 902
132     100                 901
132      40                 903
104      35                 904
104      35                 905
112       0                 906
114       0                 907
sczxawaw

sczxawaw1#

IIUC,您可以使用两步过程来获取100个值的组号,并单独计算另一个值的计数器:

cols = ['GID', 'Similarity Score']

m = df['Similarity Score'].ne(100)
s1 = df[~m].groupby(cols).ngroup()
s2 = pd.Series(range(m.sum()), index=m.index[m]).add(s1.max()+1)

df['UID'] = pd.concat([s1, s2]).add(900)

输出:

GID  Similarity Score  UID
0   71               100  900
1   71               100  900
2  132               100  901
3  132                40  902
4  132               100  901
5  132                40  903
6  104                35  904
7  104                35  905
8  112                 0  906
9  114                 0  907
yks3o0rb

yks3o0rb2#

@mozway解决方案很好,这里有一个替代方案:

from itertools import count, islice

counter = count(900)
def assign_uid(group: pd.DataFrame):
    if group["Similarity Score"].unique() == [100]:
        # same UID for all rows
        uids = next(counter)
    else:
        # unique UID for all rows
        uids = list(islice(counter, len(group)))
    return group.assign(UID=uids)

cols = ["GID", "Similarity Score"]
df.groupby(cols, group_keys=False, sort=False).apply(assign_uid)
GID  Similarity Score  UID
0   71               100  900
1   71               100  900
2  132               100  901
3  132                40  902
4  132               100  901
5  132                40  903
6  104                35  904
7  104                35  905
8  112                 0  906
9  114                 0  907
68bkxrlz

68bkxrlz3#

如果不关心行的顺序,可以使用带条件的排序和cumsum函数来创建唯一的ID。

import pandas as pd

data = pd.DataFrame({
    "GID" : [71, 71, 132, 132, 132, 132, 104, 104, 112, 114],
    "Similarity Score" : [100, 100, 100, 40, 100, 40, 35, 35, 0, 0]

})

data = data.sort_values(by = ["GID", "Similarity Score"])
data["condition"] = (data["GID"] != data["GID"].shift(1)) | (data["Similarity Score"] != data["Similarity Score"].shift(1)) | (data["Similarity Score"] != 100)
data["UID"] = data["condition"].cumsum() + 899
data = data.drop(columns = ["condition"])

您还可以在之后通过索引进行resort以恢复行的顺序(尽管ID号不会跟随行顺序):

data = data.sort_index()

相关问题