scipy 如何在python中编写while循环函数进行winsorizing

vmdwslir  于 2024-01-09  发布在  Python
关注(0)|答案(1)|浏览(190)

我有以下功能:

from scipy.stats.mstats import winsorize 
import pandas as pd

# winsorize function
def winsor_try1(var, lower, upper):
    var = winsorize(var,limits=[lower,upper])
    ''' 
    Outliers Calculation using IQR 
    ''' 
    q1, q3= np.percentile(var, [25, 75])                 # q1,q3 calc
    iqr = q3 - q1                                        # iqr calc
    lower_bound = round(q1 - (1.5 * iqr),3)              # lower bound
    upper_bound = round(q3 + (1.5 * iqr),3)              # upper bound
    outliers = [x for x in var if x < lower_bound or x > upper_bound]  
    print('These would be the outliers:', set(outliers),'\n',
          'Total:', len(outliers),'.Upper bound & Lower bound:', lower_bound,'&',upper_bound)

# the variable 
df = pd.DataFrame({
    'age': [1,1,2,5,5,2,5,4,8,2,5,1,41,2,1,4,4,1,1,4,1,2,15,21,5,1,8,22,1,5,2,5,256,5,6,2,2,8,452]})

字符串
我想写一个while loop函数,我想在变量df['age']上应用函数winsor_try1,从lower = .01upper = .01开始,直到len(outliers) = 0.
我的理由是:只要len(outliers) > 0,我想重复这个函数,直到我能找到极限,直到age分布中的离群值变为0。
期望的输出应该是这样的:

print('At limit =', i, 'there is no more outliers presented in the age variable.')


i =极限,其中len(outliers) = 0

brvekthn

brvekthn1#

您可以将其视为标量根查找问题并使用scipy.optimize.root_scalar,而不是自己编写while循环。

import numpy as np
from scipy.stats.mstats import winsorize
from scipy.optimize import root_scalar 

# winsorize function
def winsor_try1(var, lower, upper):
    ''' 
    Compute the number of IQR outliers
    ''' 
    var = winsorize(var,limits=[lower,upper])
    q1, q3= np.percentile(var, [25, 75])                 # q1,q3 calc
    iqr = q3 - q1                                        # iqr calc
    lower_bound = round(q1 - (1.5 * iqr),3)              # lower bound
    upper_bound = round(q3 + (1.5 * iqr),3)              # upper bound
    outliers = [x for x in var if x < lower_bound or x > upper_bound]  
    return len(outliers)

# the variable 
var = np.asarray([1,1,2,5,5,2,5,4,8,2,5,1,41,2,1,4,4,1,1,4,1,2,15,21,5,1,8,22,1,5,2,5,256,5,6,2,2,8,452])

def fun(i):
  # try to find `i` at which there is half an outlier
  # it doesn't exist, but this should get closer to the transition
  return winsor_try1(var, i, i) - 0.5

# root_scalar tries to find the argument `i` that makes `fun` return zero
res = root_scalar(fun, bracket=(0, 0.5))

eps = 1e-6
print(winsor_try1(var, res.root + eps, res.root + eps))  # 0
print(winsor_try1(var, res.root - eps, res.root - eps))  # 6
res.root  # 0.15384615384656308

字符串
可能有更好的方法来解决这个问题,但我试图用类似于编写while循环的方式来回答这个问题。如果你想知道while循环是如何工作的,有很多关于bisection method或其他标量寻根算法的参考资料。

相关问题