pandas 如何有效地将函数应用于 Dataframe 中的每一行

snz8szmq 于 2023-04-19 发布在其他

关注(0)|答案(5)|浏览(192)

根据下表：

df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']})

看起来像这样

code
0   100M
1   60M10N40M
2   5S99M
3   1S25I100M
4   1D1S1I200M

我想将code列字符串转换为数字，其中M、N、D分别等于（乘以1），I等于（乘以-1），S等于（乘以0）。
结果应该如下所示：

code       Val
0   100M        100     This is (100*1)
1   60M10N40M   110     This is (60*1)+(10*1)+(40*1)
2   5S99M       99      This is (5*0)+(99*1)
3   1S25I100M   75      This is (1*0)+(25*-1)+(100*1)
4   1D1S1I200M  200     This is (1*1)+(1*0)+(1*-1)+(200*1)

我写了下面的函数：

def String2Val(String):
    # Generate substrings
    sstrings = re.findall('.[^A-Z]*.', String)

    KeyDict = {'M':'*1','N':'*1','I':'*-1','S':'*0','D':'*1'}

    newlist = []
    for key, value in KeyDict.items():
        for i in sstrings:
            if key in i:
                p = i.replace(key, value)
                lp = eval(p)
                newlist.append(lp)

    OutputVal = sum(newlist)
    return OutputVal

df['Val'] = df.apply(lambda row: String2Val(row['code']), axis = 1)

在将此函数应用于表之后，我意识到它效率不高，并且在应用于大型数据集时需要很长时间。我如何优化此过程？

pandas

来源：https://stackoverflow.com/questions/75935256/how-to-efficiently-apply-a-function-to-every-row-in-a-dataframe

5条答案

按热度按时间

xzv2uavs1#

由于pandas的字符串方法没有优化（尽管pandas 2.0似乎不再是这样），如果你追求性能，最好在循环中使用Python字符串方法（这是在C中编译的）。似乎对每个字符串进行直接循环可能会给予最佳性能。

def evaluater(s):
    total, curr = 0, ''
    for e in s:
        # if a number concatenate to the previous number
        if e.isdigit():
            curr += e
        # if a string, look up its value in KeyDict
        # and multiply the currently collected number by it
        # and add to the total
        else:
            total += int(curr) * KeyDict[e]
            curr = ''
    return total

KeyDict = {'M': 1, 'N': 1, 'I': -1, 'S': 0, 'D': 1}
df['val'] = df['code'].map(evaluater)

性能：

KeyDict1 = {'M':'*1+','N':'*1+','I':'*-1+','S':'*0+','D':'*1+'}
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']*1000})

%timeit df.assign(val=df['code'].map(evaluater))
# 12.2 ms ± 579 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.assign(val=df['code'].apply(String2Val))    # @Marcelo Paco
# 61.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(val=df['code'].replace(KeyDict1, regex=True).str.rstrip('+').apply(pd.eval))   # @Ynjxsjmh
# 4.86 s ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

注意：你已经实现了类似的东西，但外部循环（for key, value in KeyDict.items()）是不必要的;因为KeyDict是一个字典，所以将其用作查找表;不要循环。另外，当只有一列相关时，.apply(axis=1)是一种非常糟糕的循环方式。选择该列并调用apply()。

赞(0）回复(0）举报 2023-04-19

q5lcpyga2#

您可以将加法符号添加到KeyDict的值，然后将code列的值替换为KeyDict，最后调用pd.eval进行计算。

KeyDict = {'M':'*1+','N':'*1+','I':'*-1+','S':'*0+','D':'*1+'}

df['val'] = (df['code'].replace(KeyDict, regex=True)
             .str.rstrip('+').apply(pd.eval))
# or you can use native python for loop since Series.apply is not efficient
df['val'] = [pd.eval(val) for val in df['code'].replace(KeyDict, regex=True).str.rstrip('+')]

print(df)

         code  val
0        100M  100
1   60M10N40M  110
2       5S99M   99
3   1S25I100M   75
4  1D1S1I200M  200

赞(0）回复(0）举报 2023-04-19

yyhrrdl83#

您可以尝试以下使用replace()的解决方案：

import pandas as pd

def String2Val(row):
    # Use replace to find an replace characters according to your KeyDict definition
    val = row.replace('M', '*1+').replace('N', '*1+').replace('I', '*-1+').replace('S', '*0+').replace('D', '*1+')
    # Ensure the last part of the string isn't a +
    if val[-1] == "+":
        # If it is, remove the + from the end
        val = val[:-1]
    # Return the evaluated value
    return eval(val)

df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']})
# Modify it to use apply only on the code column. Which removes the need to use lambda and axis=1
df['Val'] = df['code'].apply(String2Val)

df：

code  Val
0        100M  100
1   60M10N40M  110
2       5S99M   99
3   1S25I100M   75
4  1D1S1I200M  200

赞(0）回复(0）举报 2023-04-19

58wvjzkj4#

另一种可能的解决方案是用相应的乘法因子替换字母，然后用eval计算字符串：

df['val'] = (df['code'].str.replace('M|N|D', '*1+', regex=True)
             .str.replace('I', '*(-1)+', regex=True)
             .str.replace('S', '*0+', regex=True)
             .str.replace(r'\+$', '', regex=True).map(eval))

输出：

code  val
0        100M  100
1   60M10N40M  110
2       5S99M   99
3   1S25I100M   75
4  1D1S1I200M  200

赞(0）回复(0）举报 2023-04-19

bwntbbo35#

谢谢大家。如果这可能对其他人有帮助，我检查了每个人的解决方案在不同大小的数据上的性能，并获得了以下结果。非常有教育意义，在处理大型数据集时应用的伟大技巧。

赞(0）回复(0）举报 2023-04-19

我来回答

pandas 如何有效地将函数应用于 Dataframe 中的每一行

5条答案

相关问题

热门标签

最新问答