Pandas将事件按日期分组,然后测试其他值是否相等

9rbhqvlz  于 2023-03-28  发布在  其他
关注(0)|答案(2)|浏览(78)

基于疾病发生的日期和地址,我正在寻找在指定时间内发生在同一地点的疾病爆发。 Dataframe 很大-300 K行。
有一个很好的解决方案,可以通过jezrael在每行中的日期之前或之后的指定天数内匹配日期(我不确定这是否可以处理300 K行):

import pandas as pd

df = pd.DataFrame(
    [
        ['2020-01-01 10:00', '1', 'A'],
        ['2020-01-01 10:01', '2', 'A'],
        ['2020-01-01 10:02', '3a', 'A'],
        ['2020-01-01 10:02', '3b', 'B'],
        ['2020-01-01 10:30', '4', 'B'],
        ['2020-01-01 10:50', '5', 'B'],
        ['2020-01-01 10:54', '6', 'B'],
        ['2020-01-01 10:55', '7', 'B'],
    ], columns=['event_time', 'event_id', 'Address']
)

# solution matching dates within range of date in row by jezrael
df['event_time'] = pd.to_datetime(df['event_time'])

td = pd.Timedelta("1m")
f = lambda x, y: df.loc[df['event_time'].between(y - td, y + td),
                        'event_id'].drop(x).tolist()
df['related_event_id_list'] = [f(k, v) for k, v in df['event_time'].items()]
print (df)
           event_time event_id related_event_id_list  Address
0 2020-01-01 10:00:00        1                   [2]     A
1 2020-01-01 10:01:00        2           [1, 3a, 3b]     A
2 2020-01-01 10:02:00       3a               [2, 3b]     A
3 2020-01-01 10:02:00       3b               [2, 3a]     B
4 2020-01-01 10:30:00        4                    []     B
5 2020-01-01 10:50:00        5                    []     B
6 2020-01-01 10:54:00        6                   [7]     B
7 2020-01-01 10:55:00        7                   [6]     B

我尝试在原始比较中包含地址,但没有成功。我不确定如何在整个related_event_id_list(?)之间比较Addresses,或者是否先匹配地址(减少行数),然后用输出调整jezrael解决方案会更好?
输出应该允许我用开始日期,结束日期和地址来计数事件。适应jezrael解决方案,作为开始,它将是:

event_time event_id related_event_id_list  Address
0 2020-01-01 10:00:00        1                   [2]     A
1 2020-01-01 10:01:00        2               [1, 3a]     A
2 2020-01-01 10:02:00       3a                   [2]     A
3 2020-01-01 10:02:00       3b                    []     B
4 2020-01-01 10:30:00        4                    []     B
5 2020-01-01 10:50:00        5                    []     B
6 2020-01-01 10:54:00        6                   [7]     B
7 2020-01-01 10:55:00        7                   [6]     B

但是,由于前三行(和最后两行)代表一个连续的爆发,解决方案实际上更像是:

event_time_start  event_time_end     events_and_related_event_id_list  Address
0 2020-01-01 10:00:00  2020-01-01 10:02:00        [1, 2, 3a]     A
6 2020-01-01 10:54:00  2020-01-01 10:55:00        [6, 7]         B
e0bqpujr

e0bqpujr1#

您可以使用numpy broadcast来执行聚合操作:

def find_related_event(df):
    evt = df['event_time'].values
    out = np.abs(evt[:, None] - evt) <= pd.Timedelta('1m')
    out[np.diag_indices(out.shape[0])] = False
    df1 = df.loc[out.any(axis=1)]
    return pd.Series({'index': df1.index[0],
        'event_time_start': df1['event_time'].iloc[0],
        'event_time_stop': df1['event_time'].iloc[-1],
        'events_and_related_event_id_list': df1['event_id'].tolist()
    })

out = (df.groupby('Address', as_index=False).apply(find_related_event)
         .set_index('index').rename_axis(None)

输出:

>>> out
  Address    event_time_start     event_time_stop events_and_related_event_id_list
0       A 2020-01-01 10:00:00 2020-01-01 10:02:00                       [1, 2, 3a]
6       B 2020-01-01 10:54:00 2020-01-01 10:55:00                           [6, 7]

备选

def find_related_event(evt):
    out = np.abs(evt.values[:, None] - evt.values) <= pd.Timedelta('1m')
    out[np.diag_indices(out.shape[0])] = False
    return out.any(axis=1)

m = df.groupby('Address')['event_time'].transform(find_related_event)
out = df.loc[m].groupby('Address', as_index=False).agg(
            event_time_start=('event_time', 'first'),
            event_time_stop=('event_time', 'first'),
            events_and_related_event_id_list=('event_id', list)
      )
vsaztqbk

vsaztqbk2#

您可以按组使用此解决方案:

# solution matching dates within range of date in row by jezrael
df['event_time'] = pd.to_datetime(df['event_time'])

def f(g):
    td = pd.Timedelta("1m")
    f = lambda x, y: g.loc[g['event_time'].between(y - td, y + td),
                            'event_id'].drop(x).tolist()
    g['related_event_id_list'] = [f(k, v) for k, v in g['event_time'].items()]
    return g

df  = df.groupby('Address').apply(f)
print (df)
           event_time event_id Address related_event_id_list
0 2020-01-01 10:00:00        1       A                   [2]
1 2020-01-01 10:01:00        2       A               [1, 3a]
2 2020-01-01 10:02:00       3a       A                   [2]
3 2020-01-01 10:02:00       3b       B                    []
4 2020-01-01 10:30:00        4       B                    []
5 2020-01-01 10:50:00        5       B                    []
6 2020-01-01 10:54:00        6       B                   [7]
7 2020-01-01 10:55:00        7       B                   [6]

对于下一步,使用GroupBy.agg作为由related_event_id_list列中的连续非空列表值创建的组:

m = df['related_event_id_list'].astype(bool)

f1 = lambda x: list(dict.fromkeys([z for y in x for z in y]))

df = (df[m].groupby([(~m).cumsum(),'Address'])
           .agg(event_time_start=('event_time','min'),
                event_time_end=('event_time','max'),
                events_and_related_event_id_list=('related_event_id_list',f1))
           .droplevel(0)
           .reset_index())
print (df)
  Address    event_time_start      event_time_end  \
0       A 2020-01-01 10:00:00 2020-01-01 10:02:00   
1       B 2020-01-01 10:54:00 2020-01-01 10:55:00   

  events_and_related_event_id_list  
0                       [2, 1, 3a]  
1                           [7, 6]

相关问题