Pandas-Pandas中默认的大小写(& D)

qyuhtwio  于 2023-03-06  发布在  其他
关注(0)|答案(3)|浏览(118)

我在python中有下面的case语句,

pd_df['difficulty'] = 'Unknown'
pd_df['difficulty'][(pd_df['Time']<30) & (pd_df['Time']>0)] = 'Easy'
pd_df['difficulty'][(pd_df['Time']>=30) & (pd_df['Time']<=60)] = 'Medium'
pd_df['difficulty'][pd_df['Time']>60] = 'Hard'

但是当我运行代码时,它抛出了一个错误。

A value is trying to be set on a copy of a slice from a DataFrame
zqdjd7g9

zqdjd7g91#

    • 备选案文1**

为了提高性能,请使用嵌套的np.where条件。对于该条件,您可以只使用pd.Series.between,并且将相应地插入默认值。

pd_df['difficulty'] = np.where(
     pd_df['Time'].between(0, 30, inclusive=False), 
    'Easy', 
     np.where(
        pd_df['Time'].between(0, 30, inclusive=False), 'Medium', 'Unknown'
     )
)
    • 备选案文2**

类似地,使用np.select,这为添加条件提供了更多空间:

pd_df['difficulty'] = np.select(
    [
        pd_df['Time'].between(0, 30, inclusive=False), 
        pd_df['Time'].between(30, 60, inclusive=True)
    ], 
    [
        'Easy', 
        'Medium'
    ], 
    default='Unknown'
)
    • 备选案文3**

另一个高性能解决方案涉及loc

pd_df['difficulty'] = 'Unknown'
pd_df.loc[pd_df['Time'].between(0, 30, inclusive=False), 'difficulty'] = 'Easy'
pd_df.loc[pd_df['Time'].between(30, 60, inclusive=True), 'difficulty'] = 'Medium'
eqqqjvef

eqqqjvef2#

OP的代码只需要loc就可以通过[]正确地调用__setitem__()方法,特别是,他们已经使用了正确的括号()来单独计算&链接的条件。
这种方法的基本思想是用某个默认值(例如"Unknown")初始化列,并根据条件(例如如果0<Time<30,则为"Easy")更新行,等等。
当我计算这个页面上给出的选项时,对于大帧,loc方法是最快的(比np.select和嵌套np.where快4-5倍)。1.

pd_df['difficulty'] = 'Unknown'
pd_df.loc[(pd_df['Time']<30) & (pd_df['Time']>0), 'difficulty'] = 'Easy'
pd_df.loc[(pd_df['Time']>=30) & (pd_df['Time']<=60), 'difficulty'] = 'Medium'
pd_df.loc[pd_df['Time']>60, 'difficulty'] = 'Hard'

1:用于基准的代码。

def loc(pd_df):
    pd_df['difficulty'] = 'Unknown'
    pd_df.loc[(pd_df['Time']<30) & (pd_df['Time']>0), 'difficulty'] = 'Easy'
    pd_df.loc[(pd_df['Time']>=30) & (pd_df['Time']<=60), 'difficulty'] = 'Medium'
    pd_df.loc[pd_df['Time']>60, 'difficulty'] = 'Hard'
    return pd_df

def np_select(pd_df):
    pd_df['difficulty'] = np.select([pd_df['Time'].between(0, 30, inclusive='neither'), pd_df['Time'].between(30, 60, inclusive='both'), pd_df['Time']>60], ['Easy', 'Medium', 'Hard'], 'Unknown')
    return pd_df

def nested_np_where(pd_df):
    pd_df['difficulty'] = np.where(pd_df['Time'].between(0, 30, inclusive='neither'), 'Easy', np.where(pd_df['Time'].between(30, 60, inclusive='both'), 'Medium', np.where(pd_df['Time'] > 60, 'Hard', 'Unknown')))
    return pd_df

df = pd.DataFrame({'Time': np.random.default_rng().choice(120, size=15_000_000)-30})

%timeit loc(df.copy())
# 891 ms ± 6.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np_select(df.copy())
# 3.93 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit nested_np_where(df.copy())
# 4.82 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 10 loops each)
jucafojl

jucafojl3#

import numpy as np
import pandas as pd

def case_when(*args):
    return np.select(
        condlist = [args[i] for i in range(0, len(args), 2)],
        choicelist = [args[i] for i in range(1, len(args), 2)],
        default=pd.NA
    )

df = pd.DataFrame({"cola":["a","b","a","a","c","d","d","e","c"],
                   "colb":range(9)})

df["newcol"] = case_when(df["cola"] == "a","ap",
                         df["colb"] == 0, "x", # Not taken because it's after the first line
                         df["colb"] == 1, "y",
                         True, df["cola"]
                         )

df["newcolb"] = case_when(df["cola"] == "e",1,
                          df["colb"] == 8, 2
                          )

df

#   cola  colb newcol newcolb
# 0    a     0     ap    <NA>
# 1    b     1      y    <NA>
# 2    a     2     ap    <NA>
# 3    a     3     ap    <NA>
# 4    c     4      c    <NA>
# 5    d     5      d    <NA>
# 6    d     6      d    <NA>
# 7    e     7      e       1
# 8    c     8      c       2

为了可读性起见,就像在R中一样。

相关问题