对于数据集df
,我想按B
列中的两组foo
和bar
分组,并识别两组中存在的重复行。我如何才能做到这一点?
df = pd.DataFrame({'A': [1, 2, 2, 3, 3, 1],
'B': ['foo', 'bar', 'foo', 'bar', 'foo', 'foo']})
df = df.sort_values('B')
df
Out[15]:
A B
1 2 bar
3 3 bar
0 1 foo
2 2 foo
4 3 foo
5 1 foo
预期结果:
A B Indicator
1 2 bar True # value 2 also present in foo, so returns True
3 3 bar True # value 3 also present in foo, so returns True
0 1 foo False # value 1 only present in foo, so returns False
2 2 foo True # value 2 also present in bar, so returns True
4 3 foo True # value 3 also present in bar, so returns True
5 1 foo False # value 1 only present in foo, so returns False
更新:
假设B
列有2个以上类别,则样本数据df
如下:
df = pd.DataFrame({'A': [1, 2, 2, 3, 3, 2, 1], 'B': ['foo', 'bar', 'foo', 'bar', 'foo', 'baz', 'baz']})
df = df.sort_values('B')
df
Out[30]:
A B
1 2 bar
3 3 bar
5 2 baz
6 1 baz
0 1 foo
2 2 foo
4 3 foo
在这种情况下,预期结果如下:
A B Indicator
1 2 bar True # The value 2 occurs in categories baz, bar, and foo, so returns True.
3 3 bar False # The value 3 only occurs in categories bar and foo, so returns False.
5 2 baz True # The value 2 occurs in categories baz, bar, and foo, so returns True.
6 1 baz False # The value 1 only occurs in categories baz and foo, so returns False.
0 1 foo False # The value 1 only occurs in categories baz and foo, so returns False.
2 2 foo True # The value 2 occurs in categories baz, bar, and foo, so returns True.
4 3 foo False # The value 3 only occurs in categories bar and foo, so returns False.
2条答案
按热度按时间bfhwhh0e1#
由于您有多个组,因此可以用途:
输出:
pbpqsu0x2#
如果需要所有
A
组的交叉B
值,请用途:第一个想法是使用
crosstab
获取A
值,如果存在于每个组中,并过滤Series.isin
中的A
值:或者使用每组集合的交集
B
作为最后一个DataFrame
: