python 用平均值替换离群值

d7v8vwbk 于 2023-11-15 发布在 Python

关注(0)|答案(2)|浏览(162)

我有下面的函数，将删除离群值，但我想在同一列中用平均值替换它们

def remove_outlier(df_in, col_name):
        q1 = df_in[col_name].quantile(0.25)
        q3 = df_in[col_name].quantile(0.75)
        iqr = q3-q1 #Interquartile range
        fence_low  = q1-1.5*iqr
        fence_high = q3+1.5*iqr
        df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
        return df_out

字符串

python

来源：https://stackoverflow.com/questions/65961241/replace-outlier-with-mean-value

2条答案

按热度按时间

jucafojl1#

让我们尝试一下。根据您的标准识别离群值，然后直接将列的平均值分配给那些不是离群值的记录。
一些测试数据：

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': range(10), 'b': np.random.randn(10)})
# These will be our two outlier points
df.iloc[0] = -5
df.iloc[9] = 5
>>> df
   a         b
0 -5 -5.000000
1  1  1.375111
2  2 -1.004325
3  3 -1.326068
4  4  1.689807
5  5 -0.181405
6  6 -1.016909
7  7 -0.039639
8  8 -0.344721
9  5  5.000000
def replace_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df.copy()
    outliers = ~df_out[col_name].between(fence_low, fence_high, inclusive=False)
    df_out.loc[outliers, col_name] = df_out.loc[~outliers, col_name].mean()
    return df_out
>>> replace_outlier(df, 'b')
   a         b
0 -5 -0.106019
1  1  1.375111
2  2 -1.004325
3  3 -1.326068
4  4  1.689807
5  5 -0.181405
6  6 -1.016909
7  7 -0.039639
8  8 -0.344721
9  5 -0.106019

字符串
我们可以检查填充值是否等于所有其他列值的平均值：

>>> df.iloc[1:9]['b'].mean()
-0.10601866399896176

型

展开查看全部

赞(0）回复(0）举报 2023-11-15

qmb5sa222#

很好的函数！然而，当我传递参数并运行它时，在df_out.loc[outliers, col_name] = df_out.loc[~outliers, col_name].mean()处发生以下错误。
“FutureWarning：设置不兼容dtype的项已被弃用，并将在将来的pandas错误中引发。”
我只是把平均值传递给新变量ave，并把它赋给df_out.loc[outliers, col_name]，然后它就可以工作了。

def replace_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df.copy()
    outliers = ~df_out[col_name].between(fence_low, fence_high, inclusive=False)
    ave = df_out.loc[~outliers, col_name].mean()
    df_out.loc[outliers, col_name] = ave
    return df_out

字符串
我的pandas版本是2.1.0。

展开查看全部

赞(0）回复(0）举报 2023-11-15

我来回答

python 用平均值替换离群值

2条答案

相关问题

热门标签

最新问答