Pandas groupby变换调整

wmomyfyw  于 2023-10-14  发布在  其他
关注(0)|答案(4)|浏览(103)

下面是我的pandas代码,当我们groupby over mask only masked row得到正确的转换值。但是,我希望填充value_transform中的所有单元格。有办法吗?见下文预期产出

import pandas as pd

data = {
    'dates': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'flag': [True, True, False, False, False, True, True, False, True],
    'value': [10, 10, 20, 30, 40, 50, 50, 50, 60],
}

df = pd.DataFrame(data)
mask = df['flag'] == True
df['value_transform'] = df[mask].groupby('dates')['value'].transform(lambda x: max(x) - min(x))

ExpecteD:
        dates   flag  value  value_transform
0  2023-01-01   True     10    40.0
1  2023-01-02   True     10     0.0
2  2023-01-03  False     20    10.0
3  2023-01-01  False     30    40.0
4  2023-01-02  False     40     0.0
5  2023-01-03   True     50    10.0
6  2023-01-01   True     50    40.0
7  2023-01-02  False     50     0.0
8  2023-01-03   True     60    10.0
bakd9h0s

bakd9h0s1#

import pandas as pd

data = {
    'dates': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'flag': [True, True, False, False, False, True, True, False, True],
    'value': [10, 10, 20, 30, 40, 50, 50, 50, 60],
}

df = pd.DataFrame(data)

# Calculate max - min values for each 'dates' group where 'flag' is True
mask = df['flag'] == True
grouped = df[mask].groupby('dates')['value'].agg(lambda x: max(x) - min(x)).reset_index()
grouped.columns = ['dates', 'value_transform']

# Merge the calculated values back into the original DataFrame for 'flag' == True rows
df = df.merge(grouped, on='dates', how='left')

# Fill NaN values in 'value_transform' column with 0
df['value_transform'] = df['value_transform'].fillna(0)

print(df)

dates   flag  value  value_transform
0  2023-01-01   True     10               40
1  2023-01-02   True     10                0
2  2023-01-03  False     20               10
3  2023-01-01  False     30               40
4  2023-01-02  False     40                0
5  2023-01-03   True     50               10
6  2023-01-01   True     50               40
7  2023-01-02  False     50                0
8  2023-01-03   True     60               10
lvjbypge

lvjbypge2#

使用flag列在value列上运行where子句,执行groupby,计算分组对象的最小值和最大值,并将结果赋回原始的对象框:

grouped = df.value.where(df.flag).groupby(df.dates)
df.assign(value_transform = grouped.transform('max') - grouped.transform('min'))
        dates   flag  value  value_transform
0  2023-01-01   True     10             40.0
1  2023-01-02   True     10              0.0
2  2023-01-03  False     20             10.0
3  2023-01-01  False     30             40.0
4  2023-01-02  False     40              0.0
5  2023-01-03   True     50             10.0
6  2023-01-01   True     50             40.0
7  2023-01-02  False     50              0.0
8  2023-01-03   True     60             10.0
6mw9ycah

6mw9ycah3#

IIUC中,你想根据日期在value_transform中填充nans,最简单的方法是排序,然后bfill
代码如下:

df.sort_values(['dates', 'flag']).bfill().sort_index()
print(df)

输出量:

dates   flag  value  value_transform
0  2023-01-01   True     10             40.0
1  2023-01-02   True     10              0.0
2  2023-01-03  False     20             10.0
3  2023-01-01  False     30             40.0
4  2023-01-02  False     40              0.0
5  2023-01-03   True     50             10.0
6  2023-01-01   True     50             40.0
7  2023-01-02  False     50              0.0
8  2023-01-03   True     60             10.0
u3r8eeie

u3r8eeie4#

如果不能依赖于返回结果的顺序,可以计算聚合视图并使用Series.map创建新列(请注意,这实际上是reduction + {merge,join}

import pandas as pd

data = {
    'dates': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'flag': [True, True, False, False, False, True, True, False, True],
    'value': [10, 10, 20, 30, 40, 50, 50, 50, 60],
}

df = pd.DataFrame(data)

mask = df['flag'] == True
aggregations = (
    df.loc[mask].groupby('dates')['value']
    .agg(['max', 'min']) # this will be faster than passing a lambda
    .eval('max - min')
)

df['value_transform'] = df['dates'].map(aggregations)

print(df)
#         dates   flag  value  value_transform
# 0  2023-01-01   True     10               40
# 1  2023-01-02   True     10                0
# 2  2023-01-03  False     20               10
# 3  2023-01-01  False     30               40
# 4  2023-01-02  False     40                0
# 5  2023-01-03   True     50               10
# 6  2023-01-01   True     50               40
# 7  2023-01-02  False     50                0
# 8  2023-01-03   True     60               10
>>> import pandas as pd
>>> pd.__version__
'1.5.3'

相关问题