pandas 如何计算两列不同行的日期差和一个条件?

yrefmtwq  于 2023-02-20  发布在  其他
关注(0)|答案(3)|浏览(127)

基于下面的 Dataframe 示例,我想计算某个索引的两个日期时间之间的差异及其累积值。预期结果如diff_dayscum_diff days列所示
| 指标|日期_a|日期_B|不同天数|累计差异天数|
| - ------|- ------|- ------|- ------|- ------|
| 1个|2023年1月1日|钠碲|钠碲|- -|
| 1个|钠碲|钠碲|钠碲|- -|
| 1个|钠碲|2023年3月1日|第二章|第二章|
| 第二章|2023年4月1日|钠碲|钠碲|- -|
| 第二章|钠碲|钠碲|钠碲|- -|
| 第二章|钠碲|2023年6月1日|第二章|四个|
| 三个|2023年7月1日|钠碲|钠碲|- -|
| 三个|钠碲|2023年8月1日|1个|五个|
| 三个|2023年9月1日|钠碲|钠碲|- -|
| 三个|钠碲|钠碲|钠碲|- -|
| 三个|钠碲|2023年11月1日|第二章|七|
我已经检查了另一个帖子,它计算两个日期之间的差异,不幸的是,其中一个是当日期在同一行。对于我的情况,我想了解如何计算日期,如果它在不同的行在不同的列,因为只是减去它与df['diff_days'] = df['date_a']-df['date_b']将产生NaT的结果。如果有人能在这个问题上给我点化,我将不胜感激。

qybjjes1

qybjjes11#

试试这个

# python 3.10.6
from io import StringIO
import pandas as pd  # 1.5.1

string = """index   date_a  date_b  diff_days   cum_diff_days
1   1/1/2023    NaT NaT -
1   NaT NaT NaT -
1   NaT 3/1/2023    2   2
2   4/1/2023    NaT NaT -
2   NaT NaT NaT -
2   NaT 6/1/2023    2   4
3   7/1/2023    NaT NaT -
3   NaT 8/1/2023    1   5
3   9/1/2023    NaT NaT -
3   NaT NaT NaT -
3   NaT 11/1/2023   2   7"""

df = pd.read_csv(StringIO(string), sep="\t")

# convert to datetime
df["date_a"] = pd.to_datetime(df.date_a, format="%d/%m/%Y")
df["date_b"] = pd.to_datetime(df.date_b, format="%d/%m/%Y")

# forward-fill `df.date_a` and subtract from `df.date_b`
# then get `.days` attribute to convert to numeric
df["diff_days"] = df.date_b.sub(df.date_a.ffill()).dt.days

# cumulative sum the differences
df["cum_diff_days"] = df.diff_days.cumsum()

# optionally fill the nulls with "-"
df[["diff_days", "cum_diff_days"]] = df[
    ["diff_days", "cum_diff_days"]
].fillna("-")

print(df)
index     date_a     date_b diff_days cum_diff_days
0       1 2023-01-01        NaT         -             -
1       1        NaT        NaT         -             -
2       1        NaT 2023-01-03       2.0           2.0
3       2 2023-01-04        NaT         -             -
4       2        NaT        NaT         -             -
5       2        NaT 2023-01-06       2.0           4.0
6       3 2023-01-07        NaT         -             -
7       3        NaT 2023-01-08       1.0           5.0
8       3 2023-01-09        NaT         -             -
9       3        NaT        NaT         -             -
10      3        NaT 2023-01-11       2.0           7.0

参考文献:

tjvv9vkg

tjvv9vkg2#

您可以使用to_datetimewhere + bfill来形成石斑鱼,然后使用groupby.aggjoin

# ensure datetime
df[['date_a', 'date_b']] = df[['date_a', 'date_b']].apply(pd.to_datetime, dayfirst=True)

# form grouper based on backfilled date_b
# and use the index as group value
grp = df.index.to_series().where(df['date_b'].notna()).bfill()

# get the first date_a / last date_b (you can also get min/max, first/first…)
# compute the sum and cumsum
# join to original DataFrame
out = df.join(
 df.groupby(grp).agg({'date_a': 'first', 'date_b': 'last'})
   .assign(diff_days=lambda d: d['date_b'].sub(d['date_a']).dt.days,
           cum_diff_days=lambda d: d['diff_days'].cumsum()
          )[['diff_days', 'cum_diff_days']]
)

print(out)

输出:

index     date_a     date_b  diff_days  cum_diff_days
0.0       1 2023-01-01        NaT        NaN            NaN
1.0       1        NaT        NaT        NaN            NaN
2.0       1        NaT 2023-01-03        2.0            2.0
3.0       2 2023-01-04        NaT        NaN            NaN
4.0       2        NaT        NaT        NaN            NaN
5.0       2        NaT 2023-01-06        2.0            4.0
6.0       3 2023-01-07        NaT        NaN            NaN
7.0       3        NaT 2023-01-08        1.0            5.0
8.0       3 2023-01-09        NaT        NaN            NaN
9.0       3        NaT        NaT        NaN            NaN
10.0      3        NaT 2023-01-10        1.0            6.0
of1yzvn4

of1yzvn43#

建议的脚本(用于测试)

import pandas as pd

df = pd.DataFrame({'date_a': ["1/1/2023", pd.NaT, pd.NaT, "4/1/2023", pd.NaT, pd.NaT,
                   "7/1/2023", pd.NaT, "9/1/2023", pd.NaT, pd.NaT],
                   'date_b': [pd.NaT, pd.NaT, "3/1/2023", pd.NaT, pd.NaT, "6/1/2023",
                   pd.NaT, "8/1/2023", pd.NaT, pd.NaT, "11/1/2023"],
})

r = df.drop_duplicates(keep=False).copy()
r['date_a'] = r['date_a'].shift(1)
r = r.drop_duplicates(keep=False)

r['diff_days'] = (pd.to_datetime(r['date_b'], dayfirst=True) 
                  - pd.to_datetime(r['date_a'], dayfirst=True)).dt.days
r['cum_diff_days'] = r['diff_days'].cumsum()

df = df.join(r[['diff_days', 'cum_diff_days']], how='left')
df['cum_diff_days'] = df['cum_diff_days'].fillna('-') # optional

print(df)

结果

date_a     date_b  diff_days cum_diff_days
0   1/1/2023        NaT        NaN             -
1        NaT        NaT        NaN             -
2        NaT   3/1/2023        2.0           2.0
3   4/1/2023        NaT        NaN             -
4        NaT        NaT        NaN             -
5        NaT   6/1/2023        2.0           4.0
6   7/1/2023        NaT        NaN             -
7        NaT   8/1/2023        1.0           5.0
8   9/1/2023        NaT        NaN             -
9        NaT        NaT        NaN             -
10       NaT  11/1/2023        2.0           7.0

请注意,date_adate_b保留其原始类型以供进一步计算

相关问题