我想用pandas创建一个datetime duration数组。例如：

import pandas as pd
import itertools
import numpy as np

df = pd.DataFrame({
    'ts':   [1, 5, 10, 12],
    'dur':  [1, 2,  6, 6],
})
print(df)

     ts  dur
 0   1    1
 1   5    2
 2  10    6
 3  12    6

字符串
我需要用3个桶，所以：
x1c 0d1x的数据

bin1, 1
bin2, 2
bin3, 0
bin4, 4
bin5, 6
bin6, 2

型
对待Pandas的正确方法是什么？

更新为了跟进@Pieree D的回答，上面的例子很好。如果我使用ns作为单位，数据更大，问题如下：

import pandas as pd
import itertools
import numpy as np
import io

def bucket(df,ts,dur,span):
    freq = 'ns'  # could be 's' if preferred
    dfs = pd.to_timedelta(df[ts], unit=freq)  # start, inclusive
    dfe = df[ts] + df[dur]
    dfe = pd.to_timedelta(dfe, unit=freq)  # end, exclusive
    dfout = pd.concat([
        pd.Series(1, index=dfs),  # start: +1
        pd.Series(-1, index=dfe),  # end: -1
    ]).resample(freq).sum().cumsum().resample(
        f'{span}{freq}', origin='start',
    ).sum().reset_index(drop=True).rename_axis('bin')
    dfout = dfout.to_frame()
    dfout.columns= ['sum']
    dfout[ts] = df[ts].min() + dfout.index*span
    return dfout

csvdata = '''ts,dur
19318744574,391823
21320087699,527291
23322650667,345208
25325015510,355729
27327401707,356354
29329792123,464531
31332296861,408802
32596494257,1131354
32738075298,416459'''

df = pd.read_csv(io.StringIO(csvdata))
df = bucket(df,'ts','dur',50)
print(df)

型
python报告错误：

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 100. GiB for an array with shape (13419747184,) and data type int64

型
也许需要考虑使用稀疏解？

大案例更新

我最初的答案（下面）在空间中需要O(max(ts + dur) - min(ts))（在最终重新排序之前）。当它太大时，应该使用下面的稀疏版本。注意，它将跳过答案中包含0的bin。（注意，如果速度也是一个问题，我们仍然可以在纯Numpy中做得更好）。

def bin_k(t, dur, binsize=3):
    i, j = t // binsize, 1 + (t + dur - 1) // binsize
    a = np.arange(i, j)
    return a

def bin_w(t, dur, binsize=3):
    n = 1 + (t + dur - 1) // binsize - t // binsize
    a = np.repeat(binsize, n)
    adj0 = t % binsize
    adj1 = n * binsize - adj0 - dur
    a[0] -= adj0
    a[-1] -= adj1
    return a

def bucket(df, ts='ts', dur='dur', span=3):
    z = df.assign(ts=df[ts] - df[ts].min())
    out = z.assign(
        bin=z[[ts, dur]].apply(lambda r: bin_k(*r, binsize=span), axis=1),
        val=z[[ts, dur]].apply(lambda r: bin_w(*r, binsize=span), axis=1),
    ).explode(['bin', 'val']).groupby('bin')['val'].sum()
    return out

字符串
OP的大数据示例：

>>> bucket(df, span=50)
bin
0            50
1            50
2            50
3            50
4            50
             ..
268394939    50
268394940    50
268394941    50
268394942    50
268394943    33
Name: val, Length: 87961, dtype: object

型

如何使用？

让我们以小OP为例：

z = df.assign(ts=df['ts'] - df['ts'].min())
>>> z.assign(
...     bin=z[['ts', 'dur']].apply(lambda r: bin_k(*r), axis=1),
...     val=z[['ts', 'dur']].apply(lambda r: bin_w(*r), axis=1),
... )
   ts  dur        bin        val
0   0    1        [0]        [1]
1   4    2        [1]        [2]
2   9    6     [3, 4]     [3, 3]
3  11    6  [3, 4, 5]  [1, 3, 2]

型
换句话说，我们通过简单的算术为每行创建bin和值，每个都是一个Numpy数组。
之后，我们.explode()这些列，按bin和sum分组。

原始答案

情节对我来说是理解你在寻找什么的关键。你想定义区间覆盖，然后把它们分组在箱子里。
首先是代码，下面再做解释.

freq = 'D'  # could be 's' if preferred
ts = pd.to_timedelta(df['ts'], unit=freq)  # start, inclusive
te = pd.to_timedelta(df['ts'] + df['dur'], unit=freq)  # end, exclusive
z = pd.concat([
    pd.Series(1, index=ts),  # start: +1
    pd.Series(-1, index=te),  # end: -1
]).resample(freq).sum().cumsum().resample(
    f'3{freq}', origin='start',
).sum().reset_index(drop=True).rename_axis('bin')

>>> z
bin
0    1
1    2
2    0
3    4
4    6
5    2

型

说明

我们将每个ts, dur对编码为ts, te：开始（包括）和结束（不包括）。在开始时，我们加1，结束时，我们减1。在这一点上，我们可以重新采样1（频率）：

z = pd.concat([
    pd.Series(1, index=ts),  # start: +1
    pd.Series(-1, index=te),  # end: -1
]).resample(freq).sum()

>>> z.to_frame('change').assign(cover=z.cumsum())
         change  cover
1 days        1      1
2 days       -1      0
3 days        0      0
4 days        0      0
5 days        1      1
6 days        0      1
7 days       -1      0
8 days        0      0
9 days        0      0
10 days       1      1
11 days       0      1
12 days       1      2
13 days       0      2
14 days       0      2
15 days       0      2
16 days      -1      1
17 days       0      1
18 days      -1      0

型
开始和结束的+1和-1分别表示cover的变化。其中的.cumsum()是cover。正如你所看到的，它对应于该轴上点上方的蓝线数量。最后，我们重新采样到我们想要的bin。

bucket time duration带pandas

1条答案

大案例更新

如何使用？

原始答案

说明

相关问题

热门标签

最新问答