python-3.x 按时间间隔和附加属性分组

rta7y2nd  于 2022-12-20  发布在  Python
关注(0)|答案(1)|浏览(125)

我有这个数据:

import pandas as pd

data = {
    'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', '2022-11-03 00:00:35', '2022-11-03 00:00:46', '2022-11-03 00:01:21', '2022-11-03 00:01:30'],
    'from': ['A', 'A', 'A', 'A', 'B', 'C'],
    'to': ['B', 'B', 'B', 'C', 'C', 'B'],
    'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']
}

df = pd.DataFrame(data)

我希望创建两组CSV:
1.每种车辆类型一个CSV(共8个),其中各行将按时间戳(全天15分钟间隔)和“从”列分组/汇总-此处没有“到”列。
1.每种车辆类型一个CSV(共8个),其中各行将按时间戳(全天15分钟间隔)、“从”列和“到”列分组/汇总。
这两个集合的区别在于,一个集合将对所有FROM项进行计数,另一个集合将对它们进行分组,并按FROM和TO对进行计数。
输出将是15分钟间隔内给定类型车辆的总和,由“起始”列以及“起始”和“终止”列的组合进行汇总。
每种车型的第一个输出如下所示:

第二输出:

我尝试使用Pandas groupby()resample(),但由于我的知识有限,没有成功。我可以在Excel中这样做,但效率很低。我想学习Python更多,更有效,因此我想在Pandas中编码。
我试过df.groupby(['FROM', 'TO']).count(),但我缺乏知识来使用我所需要的。我总是得到错误时,我做的事情,我不应该或输出不是我所需要的。
我尝试了df.groupby(pd.Grouper(freq='15Min', )).count(),但似乎我可能有不正确的数据类型。
我不知道这是否适用。

5anewei6

5anewei61#

如果我没理解错的话,一种方法可能是:

数据

import pandas as pd

# IIUC, you want e.g. '2022-11-03 00:00:06' to be in the `00:15` bucket, we need `to_offset`
from pandas.tseries.frequencies import to_offset

# adjusting last 2 timestamps to get a diff interval group
data = {'timestamp': ['2022-11-03 00:00:06', '2022-11-03 00:00:33', 
                      '2022-11-03 00:00:35', '2022-11-03 00:00:46', 
                      '2022-11-03 00:20:21', '2022-11-03 00:21:30'], 
        'from': ['A', 'A', 'A', 'A', 'B', 'C'],
        'to': ['B', 'B', 'B', 'C', 'C', 'B'],
        'type': ['Car', 'Car', 'Van', 'Car', 'HGV', 'Van']}

df = pd.DataFrame(data)

print(df)

             timestamp from to type
0  2022-11-03 00:00:06    A  B  Car
1  2022-11-03 00:00:33    A  B  Car
2  2022-11-03 00:00:35    A  B  Van
3  2022-11-03 00:00:46    A  C  Car
4  2022-11-03 00:20:21    B  C  HGV
5  2022-11-03 00:21:30    C  B  Van

# e.g. for FROM we want:        `A`, `4` (COUNT), `00:15` (TIME-END)
# e.g. for FROM-TO we want:     `A-B`, 3 (COUNT), `00:15` (TIME-END)
#                               `A-C`, 1 (COUNT), `00:15` (TIME-END)

代码

# convert time strings to datetime and set column as index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

# add `15T (== mins) offset to datetime vals
df.index = df.index + to_offset('15T')

# create `dict` for conversion of `col names`
cols = {'timestamp': 'TIME-END', 'from': 'FROM', 'to': 'TO'}

# we're doing basically the same for both outputs, so let's use a for loop on a nested list
nested_list = [['from'],['from','to']]

for item in nested_list:
    # groupby `item` (i.e. `['from']` and `['from','to']`)
    # use `.agg` to create named output (`COUNT`), applied to `item[0]`, so 2x  on: `from`
    # and get the `count`. Finally, reset the index
    out = df.groupby(item).resample('15T').agg(COUNT=(item[0],'count')).reset_index()
    
    # rename the columns using our `cols` dict
    out = out.rename(columns=cols)
    
    # convert timestamps like `'2022-11-03 00:15:00' to `00:15`
    out['TIME-END'] = out['TIME-END'].dt.strftime('%H:%M:%S')
    
    # rearrange order of columns; for second `item` we need to include `to` (now: `TO`)
    if 'TO' in out.columns:
        out = out.loc[:, ['FROM', 'TO', 'COUNT', 'TIME-END']]
    else:
        out = out.loc[:, ['FROM', 'COUNT', 'TIME-END']]
        
    # write output to `csv file`; e.g. use an `f-string` to customize file name
    out.to_csv(f'output_{"_".join(item)}.csv') # i.e. 'output_from', 'output_from_to'
    # `index=False` avoids writing away the index

输出(在Excel中加载)

相关文件

相关问题