是否有一个pandas方法可以根据一列中的值过滤另一列

8yoxcaq7 于 2024-01-04 发布在其他

关注(0)|答案(5)|浏览(92)

是否有一个pandas方法可以根据一列中的值过滤另一列？
我有一个病人访问表，我想看看他们的第一次访问和他们的第二次访问之间的时间，但只为新客户。为了找到新的病人，我必须找到所有的病人，有一个访问类型的“摄入量”。在代码中，如果在他们的访问类型列表中有一个摄入量，我需要将所有患者ID标记为true。创建一个掩码来检查每个病人是否在那个列表中。但是我想知道是否有一个单一的方法或更好的方法，我可以在原始的掩码上使用。
当前方法：

df = pd.DataFrame({"ClientID":[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4],
"VisitType": ["Regular Session", "Regular Session", "Regular Session", "Intake", "Regular Session", "Intake", "Regular Session", "Regular Session", "Regular Session", "Regular Session", "Regular Session"]})
# create new Patient List
newPatientList = list(df.loc[df['VisitType'] == "Intake", 'ClientID'])
# check if patient in new patient list
mask = df['ClientID'].map(lambda id: True if id in newPatientList else False)
# filter dataframe
df = df[mask].copy()

字符串
输入：
| clientId|访问类型|
| --|--|
| 1 |届常会|
| 1 |届常会|
| 1 |届常会|
| 2 |进气|
| 2 |届常会|
| 3 |进气|
| 3 |届常会|
| 3 |届常会|
| 4 |届常会|
| 4 |届常会|
| 4 |届常会|
一行代码的期望输出：
| clientId|访问类型|
| --|--|
| 2 |进气|
| 2 |届常会|
| 3 |进气|
| 3 |届常会|
| 3 |届常会|

pandas

来源：https://stackoverflow.com/questions/77694519/is-there-a-pandas-method-to-filter-one-column-based-off-the-value-in-another-col

5条答案

按热度按时间

5gfr0r5j1#

你可以做一系列的客户ID

df[df["VisitType"]=="Intake"].ClientID

字符串
并通过在该序列中的ID过滤原始帧。

df[df.ClientID.isin(df[df["VisitType"]=="Intake"].ClientID)]

型

赞(0）回复(0）举报 2024-01-04

cbeh67ev2#

由于这似乎是一个有趣的问题，我对所有答案运行了timeit，包括OP的原始代码，使用这个包含1，000，000行，200，000个随机ClientID值和1/100机会VisitType == Intake的字符串：

random.seed('filtering')
df = pd.DataFrame({
    "ClientID":random.choices(range(200_000), k=1_000_000),
    "VisitType": random.choices(["Regular Session", "Intake"], cum_weights=[99,100], k=1_000_000) 
})

字符串
结果如下：
| 名称|算法|时间|
| --|--|--|
| Ziying35|过滤器（任何）|二十六点零五分|
| 尼克|transform/max| 0.098|
| 尼克|卡马克斯|0.104|
| 特德莱尼|ISIN|零点零五|
| OP|Map|九十三点八二|
| 翁布克|ISIN| 0.381|
可以看出，@tdelaney的答案比@Nick提出的下两个解决方案快2倍左右。@ombk的答案（更复杂的@tdelaney版本）又慢4倍左右，而@ziying35和OP的解决方案比@tdelaney的解决方案慢500倍以上。
完整测试代码：

timeit(setup='''
import pandas as pd
import random
random.seed('filtering')
df = pd.DataFrame({"ClientID":random.choices(range(200_000), k=1_000_000), "VisitType": random.choices(["Regular Session", "Intake"], cum_weights=[99,100], k=1_000_000) })
''', stmt='''
out = df.groupby("ClientID").filter(lambda g: g['VisitType'].eq("Intake").any())
''', number=1)

timeit(setup='''
import pandas as pd
import random
random.seed('filtering')
df = pd.DataFrame({"ClientID":random.choices(range(200_000), k=1_000_000), "VisitType": random.choices(["Regular Session", "Intake"], cum_weights=[99,100], k=1_000_000) })
''', stmt='''
out = df[(df['VisitType'] == 'Intake').groupby(df['ClientID']).transform(max)]
''', number=1)

timeit(setup='''
import pandas as pd
import random
random.seed('filtering')
df = pd.DataFrame({"ClientID":random.choices(range(200_000), k=1_000_000), "VisitType": random.choices(["Regular Session", "Intake"], cum_weights=[99,100], k=1_000_000) })
''', stmt='''
out = df[(df['VisitType'] == 'Intake').groupby(df['ClientID']).cummax()]
''', number=1)

timeit(setup='''
import pandas as pd
import random
random.seed('filtering')
df = pd.DataFrame({"ClientID":random.choices(range(200_000), k=1_000_000), "VisitType": random.choices(["Regular Session", "Intake"], cum_weights=[99,100], k=1_000_000) })
''', stmt='''
out = df[df.ClientID.isin(df[df["VisitType"]=="Intake"].ClientID)]
''', number=1)

timeit(setup='''
import pandas as pd
import random
random.seed('filtering')
df = pd.DataFrame({"ClientID":random.choices(range(200_000), k=1_000_000), "VisitType": random.choices(["Regular Session", "Intake"], cum_weights=[99,100], k=1_000_000) })
''', stmt='''
newPatientList = list(df.loc[df['VisitType'] == "Intake", 'ClientID'])
out = df[df['ClientID'].map(lambda id: id in newPatientList)]
''', number=1)

timeit(setup='''
import pandas as pd
import random
random.seed('filtering')
df = pd.DataFrame({"ClientID":random.choices(range(200_000), k=1_000_000), "VisitType": random.choices(["Regular Session", "Intake"], cum_weights=[99,100], k=1_000_000) })
''', stmt='''
out = df[df["ClientID"].isin(df[df["VisitType"].str.lower().str.contains("intake")]["ClientID"].unique())]
''', number=1)

型

赞(0）回复(0）举报 2024-01-04

iqjalb3h3#

最有趣的一个班轮。我可以简化它，仍然使它的工作。但这里你去我的大脑后，10小时的工作。
所以我按clientID分组，并对VisitType的所有单词求和;字符串的求和就像连接没有空格的字符串一样。

df[df["ClientID"].isin(df.groupby("ClientID",as_index=False).sum().query('VisitType.str.contains("Intake")')["ClientID"].tolist())]

字符串
这是一个更干净的版本。

df[df["ClientID"].isin(df[df["VisitType"].str.lower().str.contains("intake")]["ClientID"].unique())]

型

嵌套查询

df.query(f"""ClientID in {df.query("VisitType.str.contains('Intake')")["ClientID"].to_list()}""")

型

赞(0）回复(0）举报 2024-01-04

dfuffjeb4#

试试这个：

df.groupby("ClientID").filter(lambda g: g['VisitType'].eq("Intake").any())
>>>
   ClientID        VisitType
3         2           Intake
4         2  Regular Session
5         3           Intake
6         3  Regular Session
7         3  Regular Session

字符串

赞(0）回复(0）举报 2024-01-04

mkh04yzy5#

您可以使用create a mask where VisitType == 'Intake'，groupby ClientID，然后使用transform和max将mask广播到具有相同ClientID值的所有行：

mask = (df['VisitType'] == 'Intake').groupby(df['ClientID']).transform(max)
out = df[mask]

字符串
示例数据的输出：

ClientID        VisitType
3         2           Intake
4         2  Regular Session
5         3           Intake
6         3  Regular Session
7         3  Regular Session

型
请注意，如果Intake始终是客户端的第一个VisitType，则可以简单地使用cummax：

mask = (df['VisitType'] == 'Intake').groupby(df['ClientID']).cummax()

型

赞(0）回复(0）举报 2024-01-04

我来回答

是否有一个pandas方法可以根据一列中的值过滤另一列

5条答案

嵌套查询

相关问题

热门标签

最新问答