pandas Python -高效计算,其中一行的结束值是另一行的开始值

b09cbbtk  于 2023-04-10  发布在  Python
关注(0)|答案(4)|浏览(152)

我想在滚动的基础上进行简单的计算,但是当我试图用嵌套的for循环来解决这个问题时,会遇到严重的性能问题。我需要在非常大的数据上执行这种操作,但是必须使用标准的Python(包括Pandas)。值是浮点数,可以是负数,零或正数。
我有一个pd.DataFrame(df 1),它包含(由一些维度构成,我们称之为key 1和key 2)一个开始列,一个结束列和一些介于两者之间的操作列,这些操作列被用来根据开始列计算结束列。
基本上,简单的逻辑是:start + plus - minus = end,其中每行的结束值是下一行的开始值。
这需要通过两个键来完成,即分别用于AX、AY和BX。
df 2显示了期望的结果,但是如果这个任务是在更大的表上完成的,我不知道如何以一种有效的方式到达那里,而不会占用我的内存。

import pandas as pd 
import numpy as np

df1 = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,10,0], ["A", "X", 0,9,3,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

>>> df1
  key1 key2 start plus minus end
0    A    X     3    6     4   0
1    A    X     0    2    10   0
2    A    X     0    9     3   0
3    A    Y     8    3     1   0
4    A    Y     0    2     3   0
5    B    X     4    4     2   0
6    B    X     0    1     0   0
    

df2 = pd.DataFrame(np.array([["A", "X", 3,6,4,5], ["A", "X", 5,2,10,-3], ["A", "X", -3,9,3,3], ["A", "Y", 8,3,1,10], ["A", "Y", 10,2,3,9], ["B", "X", 4,4,2,2], ["B", "X", 2,1,0,3]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

>>> df2
  key1 key2 start plus minus end
0    A    X     3    6     4   5
1    A    X     5    2    10  -3
2    A    X    -3    9     3   3
3    A    Y     8    3     1  10
4    A    Y    10    2     3   9
5    B    X     4    4     2   2
6    B    X     2    1     0   3
voj3qocg

voj3qocg1#

代码(+一些数学)
keys = ['key1', 'key2']

cs = df1.groupby(keys)[['plus', 'minus']].cumsum()
start = df1.groupby(keys)['start'].transform('first')

df1['end'] = start + cs['plus'] - cs['minus']
结果
key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      0     2      1    6
2    A    X      0     5      7    4
3    A    Y      8     3      1   10
4    A    Y      0     2      3    9
5    B    X      4     4      2    6
6    B    X      0     1      0    7
说明

让我们使用以下公式计算每行的值

end1 = `start1 + plus1 - minus1`
end2 = `end1 + plus2 - minus2` 
     = `start1 + (plus1 + plus2) - (minus1 + minus2)`
end3 = `end2 + plus3 - minus3`
     = `start1 + (plus1 + plus2 + plus3) - (minus1 + minus2 + minus3)`
....

如果观察公式,就会发现一个明显的模式,即 * 结束值等于组的起始值加上“加号”行的累积和减去“减号”行的累积和 *

v6ylcynt

v6ylcynt2#

您可以为减去plusminus列创建helper Series,为两列创建每组的累积和,并为最后的end列添加start的第一个值,然后为start列使用DataFrameGroupBy.shift并将第一个值替换为Series.fillna中的原始值:

plusminus = df1['plus'].sub(df1['minus'])
df1 = df1.assign(plusminus = plusminus)
g = df1.groupby(['key1','key2'])

df1['end'] = g['plusminus'].cumsum().add(g['start'].transform('first'))
df1['start'] = g['end'].shift().fillna(df1['start']).astype(int)
df1 = df1.drop('plusminus', axis=1)
print (df1)
  key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      5     2      1    6
2    A    X      6     5      7    4
3    A    Y      8     3      1   10
4    A    Y     10     2      3    9
5    B    X      4     4      2    6
6    B    X      6     1      0    7
mrwjdhj3

mrwjdhj33#

您可以使用astypedf.iterrows()for循环的组合执行以下操作:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,1,0], ["A", "X", 0,5,7,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

# Conver columns to integer
df[['start', 'plus', 'minus', 'end']] = df[['start', 'plus', 'minus', 'end']].astype(int)

# Start the row iterator
row_iterator = df.iterrows()
# take first item from row_iterator
_, last = next(row_iterator)
# Modify the first element
last['end'] = last['start'] + last['plus'] - last['minus']
df.loc[0, :] = last
# Iterate through the rest of the rows
for i, row in row_iterator:
    # Check the keys match
    if row['key1'] == last['key1'] and row['key2'] == last['key2']:
        # Add the end of last to the start of the next row
        row['start'] = last['end']
    # Caluculate new end for row
    row['end'] = row['start'] + row['plus'] - row['minus']
    # Ensure the changes are shown in the original dataframe
    df.loc[i, :] = row
    # Last row is now the current row
    last = row

执行后,df现在是:

key1 key2  start  plus  minus  end
0    A    X      3     6      4    5
1    A    X      5     2      1    6
2    A    X      6     5      7    4
3    A    Y      8     3      1   10
4    A    Y     10     2      3    9
5    B    X      4     4      2    6
6    B    X      6     1      0    7

注意:您的df2有错误,按照您提供的逻辑,row 5start条目应该是4而不是0

e4yzc0pl

e4yzc0pl4#

由于apply方法是逐行工作的,因此可以如下使用,无需任何for循环:
# create your df1
df1 = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,1,0], ["A", "X", 0,5,7,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

df1[['start','plus','minus','end']] = df1[['start','plus','minus','end']].astype(np.int32)

# a dictionary to follow-up keys and start value
d = {'AX': None,
     'AY': None,
     'BX': None}

def helper(row):
    # modify d inside this function
    global d
    # get key by concatenating key1+key2
    key = row.key1+row.key2
    # if key is already seen, use the stored value as start value
    if d[key]:
        start = d[key]
    # if key is unseen, use the df1 start value
    else:
        start=row.start
    
    # calculate end value
    end = start + row.plus - row.minus
    
    # store the end value in dictionary
    # so that it can be used as start in next corresponding row
    d[key] = end
    # update
    return start,end

# update df1 start and end row-wise
df1[['start','end']] = df1.apply(helper,axis=1,result_type='expand')

最后,更新后的df1等同于您的df2。

相关问题