pandas 删除重复的多列并忽略方向[重复]

blmhpbnm  于 2023-08-01  发布在  其他
关注(0)|答案(3)|浏览(97)

此问题已在此处有答案

drop duplicates on multiple columns irrespective of the order (a/b == b/a) [duplicate](1个答案)
Efficient way in Pandas for removing columns with duplicate values in different columns(1个答案)
19小时前关闭。
我有以下dataframe:

data = {
    'person1_name': ['John_Ethan_Wayne', 'John_Ethan_Wayne', 'Michael_Wayne', 'Michael_Wayne', 'Patrick_Wayne', 'Patrick_Wayne'],
    'family1_name': ['Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne'],
    'person2_name': ['Michael_Wayne', 'Patrick_Wayne', 'Patrick_Wayne', 'John_Ethan_Wayne', 'John_Ethan_Wayne', 'Michael_Wayne'],
    'family2_name': ['Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne']
}

df = pd.DataFrame(data)

     person1_name family1_name      person2_name family2_name
 John_Ethan_Wayne        Wayne     Michael_Wayne        Wayne
 John_Ethan_Wayne        Wayne     Patrick_Wayne        Wayne
    Michael_Wayne        Wayne     Patrick_Wayne        Wayne
    Michael_Wayne        Wayne  John_Ethan_Wayne        Wayne
    Patrick_Wayne        Wayne  John_Ethan_Wayne        Wayne
    Patrick_Wayne        Wayne     Michael_Wayne        Wayne

字符串
我想删除(person1_name, family1_name)(person2_name, family2_name)的副本,忽略关系的方向。
最终结果应为:

person1_name family1_name      person2_name family2_name
 John_Ethan_Wayne        Wayne     Michael_Wayne        Wayne
    Michael_Wayne        Wayne     Patrick_Wayne        Wayne
    Patrick_Wayne        Wayne  John_Ethan_Wayne        Wayne

nmpmafwu

nmpmafwu1#

在您给出的示例中,以下内容就足够了:

df[df.person1_name < df.person2_name]

字符串
这是因为在两行的情况下:
A、B
B、A
它删除了B,A,因为B < A的计算结果为False。

zyfwsgd6

zyfwsgd62#

import pandas as pd

data = {
    'person1_name': ['John_Ethan_Wayne', 'John_Ethan_Wayne', 'Michael_Wayne', 'Michael_Wayne', 'Patrick_Wayne', 'Patrick_Wayne'],
    'family1_name': ['Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne'],
    'person2_name': ['Michael_Wayne', 'Patrick_Wayne', 'Patrick_Wayne', 'John_Ethan_Wayne', 'John_Ethan_Wayne', 'Michael_Wayne'],
    'family2_name': ['Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne', 'Wayne']
}

df = pd.DataFrame(data)

df['combined'] = df.apply(lambda row: frozenset({(row['person1_name'], row['family1_name']), (row['person2_name'], row['family2_name'])}), axis=1)

df = df.drop_duplicates(subset=['combined'])

df = df.sort_values(by=['person1_name', 'person2_name'])

df = df.reset_index(drop=True)

df = df.drop(columns='combined')

print(df)

字符串

35g0bw71

35g0bw713#

df['combined_names'] = df[['person1_name', 'family1_name', 'person2_name', 'family2_name']].agg(sorted, axis=1)

unique_combinations = df.drop_duplicates(subset='combined_names').drop(columns='combined_names')

字符串

相关问题