pandas 在经历了一组“groupby”操作后，如何从“DataFrame”中检索其他列？

ncgqoxb0 于 2023-06-04 发布在其他

关注(0)|答案(2)|浏览(175)

我一直在解决一些Pandas的问题，以获得更好的基础，有一个问题，我一直遇到，而试图解决问题。每当我在DataFrame上执行groupby操作，然后执行一系列其他操作以获得我需要的内容时，我已经丢失了关于我的输出的重要信息（其他描述性列），并且返回检索相同的内容是一个新的，通常更复杂的问题。
以this问题为例：
给定一个包含呼叫者呼叫历史信息的电话日志表，找出在给定日期第一次和最后一次呼叫同一个人的呼叫者。输出呼叫者ID、接收者ID和呼叫日期。
输入：caller_history：

caller_id：int64
recipient_id：int64
date_called：datetime64[ns]

下面是我为它写的代码：

import pandas as pd

caller_history['date_called'] = pd.to_datetime(caller_history['date_called'], format='%Y-%m-%d %H:%M:%S')

caller_history = caller_history.sort_values(by=['caller_id', 'date_called'])

grouping = caller_history.groupby(['caller_id', caller_history['date_called'].apply(lambda x: x.day)])
grouping.apply(lambda x: x['recipient_id'].iloc[0] == x['recipient_id'].iloc[-1]).reset_index()

我的思考过程是：
1.将date_called列转换为python可以理解的格式
1.对caller_history进行排序，首先按caller_id（将特定用户的所有调用集中在一起），然后按date_called
1.在caller_history中创建组，首先按caller_id创建组以隔离用户，然后按天创建组（这就是对apply的调用所做的）。
1.现在最后检查组内的recipient_id是否在每天的第一次和最后一次调用中匹配。
这给了我“基本上”正确的解决方案。但是现在，我没有办法检索recipient_id和date_called。对reset_index()的调用是尽可能恢复的最后努力，在本例中，我可以恢复caller_id和date_called的day部分。

pandas

来源：https://stackoverflow.com/questions/76348336/how-can-i-retrieve-other-columns-from-a-dataframe-after-having-gone-through-a

2条答案

按热度按时间

zbdgwd5y1#

您可以使用pd.Grouper按日期分组，而不是转换为天，这将保留整个日期信息：

caller_history.groupby(
    ["caller_id", pd.Grouper(key="date_called", freq="D")])\
    .agg(first=("recipient_id", "first"),
         match=("recipient_id", "last"))\
        .diff(axis=1)["match"].eq(0)\
            .reset_index(drop=False)

在每一行中：
1.按“caller_id”和日期分组
1.应用第一个和最后一个聚合，将列命名为“first”和“match”
1.取两者之差，并检查是否等于0
1.将索引重置为其他列。
如果你也想返回每个日期的recipient_id值列表，你可以将代码改为：

caller_history.groupby(
    ["caller_id", pd.Grouper(key="date_called", freq="D")])\
    .agg(first=("recipient_id", "first"),
         last=("recipient_id", "last"),
         ids=("recipient_id", list))\
        .assign(match=(lambda x: x["first"] - x["last"] == 0))\
            [["ids", "match"]].reset_index(drop=False)

其中.assign分配了一个新列，在本例中，“match”是“last”和“first”列相等的布尔值，“ids”也被创建为列表。

赞(0）回复(0）举报 2023-06-04

xe55xuns2#

考虑使用groupby().transform()来按呼叫者每天的第一次和最后一次呼叫过滤两组呼叫数据。然后加入集合并比较它们各自的接收者。（在SQL语言中，这类似于使用窗口函数，然后使用自连接）。

# ADD HELPER COLUMNS
call_history = (
    call_history.assign(
        call_date = lambda df: df["date_called"].dt.normalize(),
        first_call = lambda df: df.groupby(
            ["caller_id", "call_date"]
        )["date_called"].transform("min"),
        last_call = lambda df: df.groupby(
            ["caller_id", "call_date"]
        )["date_called"].transform("max"),
        num_calls = lambda df: df.groupby(
            ["caller_id", "call_date"]
        )["date_called"].transform("count")
    )
)

# JOIN FIRST AND LAST CALL FILTERED DATASETS
call_compare_df = (
    call_history
        .query("date_called == first_date")
        .merge(
            call_history.query("date_called == last_date"),
            on = ["caller_id", "call_date", "num_calls"],
            suffixes = ["_first", "_last"]
        ).query("recipient_id_first == recipient_id_last")
        .query("num_calls > 1")
)

赞(0）回复(0）举报 2023-06-04

我来回答

pandas 在经历了一组“groupby”操作后，如何从“DataFrame”中检索其他列？

2条答案

相关问题

热门标签

最新问答