python-3.x 我如何在Pandas框架中检查24小时内的重复内容?

klsxnrf1  于 2023-10-21  发布在  Python
关注(0)|答案(1)|浏览(88)

我有一个来自SQL查询的框架,看起来与此设置相当:

df = pd.DataFrame({
'ID': [1, 3, 1, 2, 2, 2, 3, 1, 3, 4, 4, 5, 6, 7],
'ID_Type': ['A', 'C', 'A', 'B', 'B', 'B', 'C', 'A', 'C', 'A', 'A', 'B', 'C', 'C'],
'Type_A_Value': [10, None, 10, None, None, None, None, 30, None, 55, 40, None, None, None],
'Type_B_Value': [None, None, None, 10, 11, 26, None, None, None, None, None, 19, None, None],
'Type_C_Value': [None, 4.3, None, None, None, None, 89, None, 12.3, None, None, None, 27, 55],
'Datetime': ['2022-10-03 08:00:00', '2022-10-01 09:00:00', '2022-10-02 08:00:00',
             '2022-10-01 11:00:00', '2022-10-02 11:00:00', '2022-10-02 13:00:00',
             '2022-10-01 14:00:00', '2022-10-01 15:00:00', '2022-10-02 07:00:00',
             '2022-10-01 14:00:00', '2022-10-01 15:00:00', '2022-10-02 07:00:00',
             '2022-09-01 14:00:00', '2022-10-01 01:00:00'] })

我在找一种方法来检查相隔24小时的重复信息。一般来说,重复可能是可以的,但是如果10日上午10点的数据和11日上午10点的数据是相同的,我想强调这一点,也许用布尔列。
总结一下我到目前为止所做的尝试:

  • 按日期时间对 Dataframe 排序和索引
df['Datetime'] = pd.to_datetime(df['Datetime'])
 df = df.sort_values(['ID_Type', 'ID', 'Datetime'])
 df = df.set_index('Datetime')
  • 使用df.shift函数(如果使用DateTime作为索引,则不起作用,如果不是索引,则给出NotImplementedError)
df['24HoursAgo'] = df['Datetime'].shift(-1, freq='24H')
  • 创建一个布尔列,检查24小时内的值是否相同(有缺陷,因为它也不会考虑设备ID)
grouped_df['SameValue24hoursApart'] = grouped_df['Type_A_Value'] == grouped_df['Type_A_Value'].shift(-1, freq='24H')

从本质上讲,我需要的是能够按设备类型和ID(或只是ID,因为ID不能有一个以上的类型)分组,然后检查该ID是否有数据重复,其各自的类型列,这是24小时分开
(编辑以包括重复示例)

yfwxisqw

yfwxisqw1#

你的例子中没有这样的数据,但是你确实可以使用排序比较:

df['Datetime'] = pd.to_datetime(df['Datetime'])

df['24HoursAgo'] = (df
   .sort_values(['ID_Type', 'ID', 'Datetime'])
   .groupby(['ID_Type', 'ID'])['Datetime']
   .diff().eq(pd.Timedelta('24h'))
)

输出量:

ID ID_Type  Type_A_Value  Type_B_Value  Type_C_Value            Datetime  24HoursAgo
0    1       A          10.0           NaN           NaN 2022-10-03 08:00:00       False
1    3       C           NaN           NaN           4.3 2022-10-01 09:00:00       False
2    1       A          10.0           NaN           NaN 2022-10-02 10:00:00       False
3    2       B           NaN          10.0           NaN 2022-10-01 11:00:00       False
4    2       B           NaN          18.0           NaN 2022-10-01 12:00:00       False
5    2       B           NaN          26.0           NaN 2022-10-02 13:00:00       False
6    3       C           NaN           NaN          89.0 2022-10-01 14:00:00       False
7    1       A          30.0           NaN           NaN 2022-10-01 15:00:00       False
8    3       C           NaN           NaN          12.3 2022-10-02 07:00:00       False
9    4       A          55.0           NaN           NaN 2022-10-01 14:00:00       False
10   4       A          40.0           NaN           NaN 2022-10-01 15:00:00       False
11   5       B           NaN          19.0           NaN 2022-10-02 07:00:00       False
12   6       C           NaN           NaN          27.0 2022-09-01 14:00:00       False
13   7       C           NaN           NaN          55.0 2022-10-01 01:00:00       False

或者,如果你想在比较中使用所有列,那么你可以将24h添加到日期时间并计算merge

df['Datetime'] = pd.to_datetime(df['Datetime'])

out = df.merge(df.assign(Datetime=df['Datetime'].add(pd.Timedelta('24h')))
                 .reset_index(), how='left')

如果有任何重复,这将在“index”列中添加索引:

ID ID_Type  Type_A_Value  Type_B_Value  Type_C_Value            Datetime  index
0    1       A          10.0           NaN           NaN 2022-10-03 08:00:00    NaN
1    3       C           NaN           NaN           4.3 2022-10-01 09:00:00    NaN
2    1       A          10.0           NaN           NaN 2022-10-02 10:00:00    NaN
3    2       B           NaN          10.0           NaN 2022-10-01 11:00:00    NaN
4    2       B           NaN          18.0           NaN 2022-10-01 12:00:00    NaN
5    2       B           NaN          26.0           NaN 2022-10-02 13:00:00    NaN
6    3       C           NaN           NaN          89.0 2022-10-01 14:00:00    NaN
7    1       A          30.0           NaN           NaN 2022-10-01 15:00:00    NaN
8    3       C           NaN           NaN          12.3 2022-10-02 07:00:00    NaN
9    4       A          55.0           NaN           NaN 2022-10-01 14:00:00    NaN
10   4       A          40.0           NaN           NaN 2022-10-01 15:00:00    NaN
11   5       B           NaN          19.0           NaN 2022-10-02 07:00:00    NaN
12   6       C           NaN           NaN          27.0 2022-09-01 14:00:00    NaN
13   7       C           NaN           NaN          55.0 2022-10-01 01:00:00    NaN

相关问题