pandas 如何筛选一列中的值，而另一列中的值为公共值，并在筛选中应用两个条件

r3i60tvu 于 2023-04-04 发布在其他

关注(0)|答案(3)|浏览(176)

我有一个这样的数据框

df = pd.DataFrame({'patient': ['patient1', 'patient1', 'patient1','patient2', 'patient2', 'patient3','patient3','patient4'], 
                   'gene':['TYR','TYR','TYR','TYR','TYR','TYR','TYR','TYR'],
                   'variant': ['buu', 'luu', 'stm','lol', 'bla', 'buu', 'lol','buu'],
                    'genotype': ['hom', 'het', 'hom','het', 'hom', 'het', 'het','het']})
df

 patient    gene    variant genotype
0   patient1    TYR buu hom
1   patient1    TYR luu het
2   patient1    TYR stm hom
3   patient2    TYR lol het
4   patient2    TYR bla hom
5   patient3    TYR buu het
6   patient3    TYR lol het
7   patient4    TYR buu het

我想识别出有buu和其他变体但没有luu的患者。

patient gene variant genotype

patient3  TYR     buu      het
patient3  TYR     lol      het

我该怎么做？

pandas

来源：https://stackoverflow.com/questions/75914405/how-to-filter-values-in-a-column-with-common-values-in-another-common-and-apply

3条答案

按热度按时间

tuwxkamq1#

使用`set`操作：

# aggregate the variants as sets
g = df.groupby('patient')['variant'].agg(set)

# keep the patients with more than "buu" but not "luu"
keep = g.index[g.gt({'buu'}) & ~g.ge({'luu'})]
# ['patient3']

# index rows of the above patients
out = df[df['patient'].isin(keep)]

输出：

patient gene variant genotype
5  patient3  TYR     buu      het
6  patient3  TYR     lol      het

中间体：

g

patient
patient1    {buu, stm, luu}
patient2         {bla, lol}
patient3         {buu, lol}
patient4              {buu}
Name: variant, dtype: object

g.gt({'buu'})  # g > {'buu'}

patient
patient1     True
patient2    False
patient3     True
patient4    False
Name: variant, dtype: bool

~g.ge({'luu'}) # ~(g>={'luu'})

patient
patient1    False
patient2     True
patient3     True
patient4     True
Name: variant, dtype: bool

使用`groupby.agg`

m = (
 df.assign(has_buu=df['variant'].eq('buu'),
           not_luu=df['variant'].ne('luu'),
           other=~df['variant'].isin(['buu', 'luu'])
          )
   .groupby('patient')
   .agg({'has_buu': 'any', 'not_luu': 'all', 'other': 'any'})
)

out = df[df['patient'].isin(m.index[m.all(axis=1)])]

输出：

patient gene variant genotype
5  patient3  TYR     buu      het
6  patient3  TYR     lol      het

中间体：

m

          has_buu  not_luu  other
patient                          
patient1     True    False   True
patient2    False     True   True
patient3     True     True   True
patient4     True     True  False

赞(0）回复(0）举报 2023-04-04

e0bqpujr2#

您也可以尝试以下解决方案：

import pandas as pd

# First we filter out those groups that have only 1 observations
g = df.groupby('patient').filter(lambda x: len(x) > 1)

# Then we apply both of our desired conditions
m = (g.groupby('patient')['variant'].transform(lambda x: x.eq('buu').any() & (~ x.eq('luu').any())))

g.loc[m]

    patient gene variant genotype
5  patient3  TYR     buu      het
6  patient3  TYR     lol      het

赞(0）回复(0）举报 2023-04-04

mqkwyuun3#

可能有一个一行的解决方案，但我宁愿构建一个，这样你就可以理解它背后的逻辑。你想得到所有没有'luu'变体的患者。在数据库世界中，更简单的方法是得到所有 * 有 * 'luu'并将其从原始数据库中分离出来。

1.获取带有“luu”变体的患者：*

patients = list(df[df['variant'] == 'luu']['patient'])

这将返回具有'luu'作为变体的患者列表。

获取属于其他患者的所有条目：*

df = df[~df.patient.isin(patients)]

对于您的输入，您将获得：

df
    patient gene variant genotype
3  patient2  TYR     lol      het
4  patient2  TYR     bla      hom
5  patient3  TYR     buu      het
6  patient3  TYR     lol      het
7  patient4  TYR     buu      het

从这一点上，我不完全确定如何从“我想识别有buu和其他变体但没有luu的患者”中获得预期的输出。但是如果你想一次获得多个表达式，你可以这样做：

df[~df.patient.isin(patients)][df.genotype == 'het']

为此，它将返回：

patient gene variant genotype
3  patient2  TYR     lol      het
5  patient3  TYR     buu      het
6  patient3  TYR     lol      het
7  patient4  TYR     buu      het

赞(0）回复(0）举报 2023-04-04

我来回答

pandas 如何筛选一列中的值，而另一列中的值为公共值，并在筛选中应用两个条件

3条答案

使用`set`操作：

使用`groupby.agg`

相关问题

热门标签

最新问答

pandas 如何筛选一列中的值，而另一列中的值为公共值，并在筛选中应用两个条件

3条答案

使用set操作：

使用groupby.agg

相关问题

热门标签

最新问答

使用`set`操作：

使用`groupby.agg`