pandas 在另一列的指定组中查找另一列中存在重复项的行

7nbnzgx9  于 2023-06-20  发布在  其他
关注(0)|答案(2)|浏览(167)

对于数据集df,我想按B列中的两组foobar分组,并识别两组中存在的重复行。我如何才能做到这一点?

df = pd.DataFrame({'A': [1, 2, 2, 3, 3, 1],
                   'B': ['foo', 'bar', 'foo', 'bar', 'foo', 'foo']})
df = df.sort_values('B')
df
Out[15]: 
   A    B
1  2  bar
3  3  bar
0  1  foo
2  2  foo
4  3  foo
5  1  foo

预期结果:

A    B  Indicator
1  2  bar  True  # value 2 also present in foo, so returns True
3  3  bar  True  # value 3 also present in foo, so returns True
0  1  foo  False  # value 1 only present in foo, so returns False
2  2  foo  True  # value 2 also present in bar, so returns True
4  3  foo  True  # value 3 also present in bar, so returns True
5  1  foo  False  # value 1 only present in foo, so returns False

更新:

假设B列有2个以上类别,则样本数据df如下:

df = pd.DataFrame({'A': [1, 2, 2, 3, 3, 2, 1],  'B': ['foo', 'bar', 'foo', 'bar', 'foo', 'baz', 'baz']})
df = df.sort_values('B')
df
Out[30]: 
   A    B
1  2  bar
3  3  bar
5  2  baz
6  1  baz
0  1  foo
2  2  foo
4  3  foo

在这种情况下,预期结果如下:

A    B  Indicator
1  2  bar  True  # The value 2 occurs in categories baz, bar, and foo, so returns True.
3  3  bar  False  # The value 3 only occurs in categories bar and foo, so returns False.
5  2  baz  True  # The value 2 occurs in categories baz, bar, and foo, so returns True.
6  1  baz  False  # The value 1 only occurs in categories baz and foo, so returns False.
0  1  foo  False  # The value 1 only occurs in categories baz and foo, so returns False.
2  2  foo  True  # The value 2 occurs in categories baz, bar, and foo, so returns True.
4  3  foo  False  # The value 3 only occurs in categories bar and foo, so returns False.
bfhwhh0e

bfhwhh0e1#

由于您有多个组,因此可以用途:

data = {'A': [2, 3, 2, 1, 1, 2, 3],
        'B': ['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'foo']}
df = pd.DataFrame(data).sort_values('B')

df['Indicator'] = df.groupby('A')['B'].transform('nunique') == df['B'].nunique()

输出:

>>> df
   A    B  Indicator
0  2  bar       True
1  3  bar      False
2  2  baz       True
3  1  baz      False
4  1  foo      False
5  2  foo       True
6  3  foo      False
pbpqsu0x

pbpqsu0x2#

如果需要所有A组的交叉B值,请用途:
第一个想法是使用crosstab获取A值,如果存在于每个组中,并过滤Series.isin中的A值:

df1 = pd.crosstab(df.A, df.B).astype(bool)

df['Indicator'] = df['A'].isin(df1.index[df1.all(axis=1)])
print(df)
   A    B  Indicator
1  2  bar       True
3  3  bar       True
0  1  foo      False
2  2  foo       True
4  3  foo       True
5  1  foo      False

或者使用每组集合的交集B作为最后一个DataFrame

setlist = df.groupby('B')['A'].agg(set)
print (setlist)
B
bar       {2, 3}
baz       {1, 2}
foo    {1, 2, 3}
Name: A, dtype: object

u = set.intersection(*setlist)
print (u)
{2}

df['Indicator'] = df['A'].isin(u)
print (df)
   A    B  Indicator
1  2  bar       True
3  3  bar      False
5  2  baz       True
6  1  baz      False
0  1  foo      False
2  2  foo       True
4  3  foo      False
5  1  foo      False

相关问题