pandas 如何在Python中根据其他行中的值过滤行?

7jmck4yq  于 2023-04-28  发布在  Python
关注(0)|答案(4)|浏览(105)

实际上,我有一个类似于下面的数据集,但要大得多。每行都是足球比赛中的一个镜头, Dataframe 包含许多不同的比赛组合(由Match_ID表示)。
我想过滤在Goal之后30分钟内发生但在同一场比赛中的射门。因此,我想保留对于同一Match_ID在当前行之前30分钟内有shot_outcome进球的行。
但是我想对整个数据集的所有Match_ID都这样做,所以这里,我想保留第3行和第6行,我该怎么做呢?
| 联系方式|匹配ID|分钟|快照_结果|
| --------------|--------------|--------------|--------------|
| 0|3857257|三|阻塞|
| 1|3857257|二十三|进球|
| 二|4857254|三十|进球|
| 三|4857254|四十五|关闭T|
| 四|4857254|八九|得救了|
| 五|6789234|三十四|进球|
| 六|6789234|四十七|进球|
我是Python的新手,所以我不知道如何处理这个问题。

bnl4lu3b

bnl4lu3b1#

可能有更有效的方法来解决这个问题,但这里有一个解决方案:

示例数据

import numpy as np
import pandas as pd

np.random.seed(234)
df = pd.DataFrame({
    'ID': np.random.choice(list(range(50)), size= 500, replace=True)
    , 'time': np.random.randint(1, 90, size= 500)
    , 'action': np.random.choice(['Goal', 'Blocked', 'Save'], size= 500, replace=True
                                 , p= [0.05, 0.5, 0.45])
})

代码

def shot_filter(df, interval=29):
    """
    For the input `df`, filter for goals + non-goal actions that occur w/in `interval` of the
    goal's time
    Return the subset
    """
    df.sort_values(by='time', inplace=True)
    actions = df['action'].tolist()
    ts = df['time'].tolist()
    idx = []
    i, goal, goal_time = 0, False, None
    while i < df.shape[0]:
        if goal and ts[i] < goal_time + interval and actions[i] != 'Goal':
            idx.append(i)
        elif actions[i] == 'Goal':
            idx.append(i)
            goal = True
            goal_time = ts[i]
        i += 1
    return df.iloc[idx, :]

# Apply per match ID
subset = []
for j in np.unique(df['ID']):
    sub_df = df.loc[df['ID'] == j].copy(deep=True)
    sub_df = shot_filter(sub_df, 29)
    subset.append(sub_df)

subset = pd.concat(subset)

廉价但不完整的验证

from collections import Counter

Counter(df['action'])
# Counter({'Blocked': 254, 'Save': 224, 'Goal': 22})

Counter(subset['action'])
# Counter({'Blocked': 35, 'Save': 27, 'Goal': 22})
5q4ezhmt

5q4ezhmt2#

给定输入DataFrame(分配给input_df):

df_list = []

for id in set(input_df["Match_ID"]):
    filtered_df = input_df[input_df["Match_ID"] == id].sort_values("Minute")

    if len(filtered_df) > 1 and "Goal" in set(filtered_df["Shot_Outcome"]):
        minute_of_earliest_goal = filtered_df[filtered_df["Shot_Outcome"] == "Goal"][
            "Minute"
        ].min()
        max_minute_mark = 30 + minute_of_earliest_goal
        minutes_mask = (filtered_df["Minute"] > minute_of_earliest_goal) & (
            filtered_df["Minute"] < max_minute_mark
        )
        temp_df = filtered_df[minutes_mask]
        if not temp_df.empty:
            df_list.append(temp_df)

res_df = pd.concat(df_list)

res_df将显示以下内容:
| 联系方式|匹配ID|分钟|快照_结果|
| --------------|--------------|--------------|--------------|
| 三|4857254|四十五|关闭T|
| 六|6789234|四十七|进球|

ejk8hzay

ejk8hzay3#

这里是一个修改后的代码,应该可以满足你的期望。希望这对你有帮助:)
输入数据
| | 匹配ID|分钟|快照_结果|
| --------------|--------------|--------------|--------------|
| 0|3857257|三|阻塞|
| 1|3857257|二十三|进球|
| 二|4857254|三十|进球|
| 三|4857254|四十五|关闭T|
| 四|4857254|八九|得救了|
| 五|6789234|三十四|进球|
| 六|6789234|四十七|进球|
| 七|6789234|四十九|进球|
| 八个|6789234|六十七|进球|
密码

# Get the Matches unique IDs
unique_ID = df['Match_ID'].unique()
# Create an empty list to store the indices of the shots that have taken place up to 30 minutes after a Goal but in the same match
indices_list = []

# For each unique Match...
for ID in unique_ID:
    # ... Create a subset of the DataFrame
    df_ID = df[df['Match_ID']==ID]
    # If a Goal is scored during the match and at least two shots have been registered
    if len(df_ID)>1 and 'Goal' in df_ID['Shot_Outcome'].values:
        # Create a subset of the match shots where Shot_Outcome == "Goal"
        df_goal_subset = df_ID[df_ID['Shot_Outcome']=='Goal']
        # Loop through each goal in the subset and add all shots within 30 minutes to the indices_list
        for _, row in df_goal_subset.iterrows():
            earliest_goal = row['Minute']
            filtered_df = df_ID[(df_ID['Minute']>earliest_goal) & (df_ID['Minute'] <= (earliest_goal + 30))]
            indices_list.extend(list(filtered_df.index))

# Filter the original DataFrame on the indices of the shots that have taken place up to 30 minutes after a Goal but in the same match
df[df.index.isin(indices_list)]

输出
| | 匹配ID|分钟|快照_结果|
| --------------|--------------|--------------|--------------|
| 三|4857254|四十五|关闭T|
| 六|6789234|四十七|进球|
| 七|6789234|四十九|进球|
| 八个|6789234|六十七|进球|

oprakyz7

oprakyz74#

这也应该起作用:

df = df.sort_values('Shot_Outcome',key = lambda x: x.ne('Goal')).sort_values('Minute',kind='mergesort')

(df.loc[(df.groupby(['Match_ID',
                     df['Shot_Outcome'].eq('Goal').groupby(df['Match_ID']).cumsum().loc[lambda x: x.gt(0)]])['Minute'].transform(lambda x: x.diff().cumsum()).le(30)) | 
                     df['Shot_Outcome'].eq('Goal')]
                     .sort_values(['Match_ID','Minute']))

相关问题