如何在pandas中使用loc方法进行过滤时应用函数？

ibrsph3r 于 2023-03-28 发布在其他

关注(0)|答案(2)|浏览(132)

我在pandas中有一个dataframeA和dataframeB，我想更新A的一个列，如果某行的某些条件在B中匹配，我想在多个条件匹配时，对当前行应用一个名为“similar”的函数，如下所示：

def similar(a, b):
    match_ratio = SequenceMatcher(None, a, b).ratio()
    if match_ratio > 0.6:
        return True
    else:
        return False

def updateLabel(repo_name, str_a):

    str_to_check = re.sub('[^a-zA-Z0-9]+', '', str_a)        
    data = B.loc[(B['repo_name'] == repo_name) & similar(B["sanitized_str_b"], str_to_check)]
    
    if len(data) > 0:
        return "TP"
    
    return "FP"
    

A["label"] = A[["repo_name", "str_a"]].apply(lambda x: updateLabel(x.repo_name, x.str_a), axis = 1)

但是，它抛出了一个错误。然后我试着像下面这样，但它非常慢。

def updateLabel(repo_name, str_a):
    
    str_to_check = re.sub('[^a-zA-Z0-9]+', '', str_a)
    
    def similar(a):
        match_ratio = SequenceMatcher(None, a, str_to_check).ratio()
        if match_ratio > 0.6:
            return True
        else:
            return False
    
    data = B.loc[(B['repo_name'] == repo_name)]
    data = B.loc[data.sanitized_str_b.apply(similar)]
    
    if len(data) > 0:
        return "TP"
    
    return "FP"
    

A["label"] = A[["repo_name", "str_a"]].apply(lambda x: updateLabel(x.repo_name, x.str_a), axis = 1)

通过调用panda中的函数来过滤一个有多个条件和条件的 Dataframe ，正确的方法是什么？

pandas

来源：https://stackoverflow.com/questions/75808155/how-to-apply-a-function-while-filtering-using-loc-method-in-pandas

2条答案

按热度按时间

af7jpaap1#

如果包含一个minimal reproducible example，提供优化线索会更容易。不过，我看到了一种提高速度的方法：如果只是检查data的长度是否大于0，则不需要使用apply(similar)过滤data。使用any：

def updateLabel(repo_name, str_a):
    
    str_to_check = re.sub('[^a-zA-Z0-9]+', '', str_a)
    
    def similar(a):
        match_ratio = SequenceMatcher(None, a, str_to_check).ratio()
        if match_ratio > 0.6:
            return True
        else:
            return False
    
    data = B.loc[(B['repo_name'] == repo_name)]
    
    if any(similar(x) for x in data.sanitized_str_b):
        return "TP"
    
    return "FP"
    

A["label"] = A[["repo_name", "str_a"]].apply(lambda x: updateLabel(x.repo_name, x.str_a), axis = 1)

赞(0）回复(0）举报 2023-03-28

4nkexdtk2#

看起来您想要“模糊匹配”str列。
与SequenceMatcher相比，有更快的选项，例如RapidFuzz-在stackoverflow上有各种pandas/rapidfuzz示例。
由于您还希望在repo_name上进行相等匹配，因此可以使用原生支持这两种操作的东西，例如duckdb

import duckdb
import pandas as pd

A = pd.DataFrame({"repo_name": [1, 2, 3, 4], "str_a": ["foo", "bar", "baz", "omg"]}).reset_index()
B = pd.DataFrame({"repo_name": [2, 3, 4, 2, 3], "sanitized_str_b": ["car", "daz", "gmo", "far", "ggg"]}).reset_index()

>>> A
   index  repo_name str_a
0      0          1   foo
1      1          2   bar
2      2          3   baz
3      3          4   omg

>>> B
   index  repo_name sanitized_str_b
0      0          2             car
1      1          3             daz
2      2          4             gmo
3      3          2             far
4      4          3             ggg

duckdb可以read from pandas dataframes：

duckdb.sql("""
from A left join B
on 
   A.repo_name = B.repo_name
   and
   jaro_winkler_similarity(str_a, sanitized_str_b) > .6
select 
   distinct on (A.repo_name) *
order by A.index
""")

然后可以根据NULL值分配TP/FP结果。

┌───────┬───────────┬─────────┬───────┬───────────┬─────────────────┐
│ index │ repo_name │  str_a  │ index │ repo_name │ sanitized_str_b │
│ int64 │   int64   │ varchar │ int64 │   int64   │     varchar     │
├───────┼───────────┼─────────┼───────┼───────────┼─────────────────┤
│     0 │         1 │ foo     │  NULL │      NULL │ NULL            │
│     1 │         2 │ bar     │     0 │         2 │ car             │
│     2 │         3 │ baz     │     1 │         3 │ daz             │
│     3 │         4 │ omg     │  NULL │      NULL │ NULL            │
└───────┴───────────┴─────────┴───────┴───────────┴─────────────────┘

jaro_winkler_similarity()被用作记分器。
.df()可用于将结果转换为pandas Dataframe

赞(0）回复(0）举报 2023-03-28

我来回答

如何在pandas中使用loc方法进行过滤时应用函数？

2条答案

相关问题

热门标签

最新问答