Pandas:跨多索引列应用不同函数的简洁方法

hgncfbus  于 2023-02-07  发布在  其他
关注(0)|答案(3)|浏览(94)

我有一个多索引 Dataframe 。我想创建一个新列,它的值是其他列的函数。问题是函数在少数级别上是不同的。
为了做到这一点,我不得不手动定义层次数据集中每个叶级别的计算。这是不可取的,因为大多数级别使用相同的计算。
下面是我正在做的事情的一个例子,以及我目前是如何做的。数据和函数是为了简单而设计的--实际的用例要笨拙得多。

import pandas as pd
from io import StringIO

testdata = """
level1,level2,value1,value2
root1,child1,10,20
root1,child2,30,40
root1,child3,50,60
root1,child4,70,80
root1,child5,90,100
"""

df = pd.read_csv(StringIO(testdata), index_col=[0,1], header=[0])
print('Starting Point:')
print(df)

df = df.unstack('level2')
print('Unstacked Version allowing me to define a different function for each level.')
print(df)

# This is the bit I'd like to make simpler. Imagine there was 20 of these child levels and only
# the last 2 were special cases.
df[('derived', 'child1')] = df[('value1', 'child1')] + df[('value2', 'child1')] 
df[('derived', 'child2')] = df[('value1', 'child2')] + df[('value2', 'child2')] 
df[('derived', 'child3')] = df[('value1', 'child3')] + df[('value2', 'child3')] 
df[('derived', 'child4')] = 0.0
df[('derived', 'child5')] = df[('value1', 'child5')] * df[('value2', 'child5')]

print('Desired outcome:')
df = df.stack()
print(df)

输出:

Starting Point:
               value1  value2
level1 level2
root1   child1      10      20
       child2      30      40
       child3      50      60
       child4      70      80
       child5      90     100
Unstacked Version allowing me to define a different function for each level.
       value1                             value2
level2 child1 child2 child3 child4 child5 child1 child2 child3 child4 child5
level1
root1       10     30     50     70     90     20     40     60     80    100
Desired outcome:
               value1  value2  derived
level1 level2
root1   child1      10      20     30.0
       child2      30      40     70.0
       child3      50      60    110.0
       child4      70      80      0.0
       child5      90     100   9000.0
wvmv3b1j

wvmv3b1j1#

我们可以使用原始df而不进行堆叠:

from io import StringIO

testdata = """
level1,level2,value1,value2
root1,child1,10,20
root1,child2,30,40
root1,child3,50,60
root1,child4,70,80
root1,child5,90,100
"""

df = pd.read_csv(StringIO(testdata), index_col=[0,1], header=[0])

level2 = df.index.get_level_values('level2')

cond = [level2 == 'child5', level2 == 'child4']

result = [df.prod(axis=1), 0]

derived = np.select(cond, result, default = df.sum(axis=1))

df.assign(derived = derived)
               value1  value2  derived
level1 level2                         
root1  child1      10      20       30
       child2      30      40       70
       child3      50      60      110
       child4      70      80        0
       child5      90     100     9000
w6mmgewl

w6mmgewl2#

由于 "只有最后2个是特殊情况",您可以重置索引,对切片执行矢量化计算,然后恢复索引:

df = df.reset_index()
df.loc[df.index[:-2], 'derived'] = df['value1'] + df['value2']
df.loc[df.index[-2], 'derived'] = 0
df.loc[df.index[-1], 'derived'] = df.loc[df.index[-1], 'value1'] * df.loc[df.index[-1], 'value2']
df.set_index(['level1', 'level2'], inplace=True)

print(df)
value1  value2  derived
level1 level2                         
root   child1      10      20     30.0
       child2      30      40     70.0
       child3      50      60    110.0
       child4      70      80      0.0
       child5      90     100   9000.0
yqkkidmi

yqkkidmi3#

使用服装函数和lambda:

def func1(cols):
    return cols["value1"] + cols["value2"]
    
def func2(cols):
    return 0.0
    
def func3(cols):
    return cols["value1"] * cols["value2"]

df["derived"] = df.apply(lambda cols: func1(cols) if cols.name[1] != "child4" 
                            and cols.name[1] != "child5" else (func2(cols) 
                            if cols.name[1] == "child4" 
                            else func3(cols)), axis=1)
print(df)

你也可以选择使用一个预定义的字典来简化lambda函数:
x一个一个一个一个x一个一个二个x

相关问题