scipy 根据其他列的值组合将函数应用于数据框列

13z8s7eq  于 2022-11-10  发布在  其他
关注(0)|答案(1)|浏览(182)

我有一个 Dataframe ,如下所示:
| 地区名称|国家/地区|产品名称|年份|标价|
| - -|- -|- -|- -|- -|
| 非洲|南非Name| ABC公司|二〇一六年|500个|
| 非洲|南非Name| ABC公司|二〇一七年|四百人|
| 非洲|南非Name| ABC公司|二〇一八年|十五个|
| 非洲|南非Name| ABC公司|二〇一九年|四百五十人|
| 非洲|乌干达Name| ABC公司|二〇一六年|七百五十人|
| 非洲|乌干达Name| ABC公司|二〇一七年|六百七十人|
| 非洲|乌干达Name| ABC公司|二〇一八年|一千三百|
| 非洲|乌干达Name| ABC公司|二〇一九年|八百九十|
| 亚洲地区|日本Name|定义|二〇一六年|500个|
| 亚洲地区|日本Name|定义|二〇一七年|四百二十人|
| 亚洲地区|日本Name|定义|二〇一八年|四百一十五人|
| 亚洲地区|日本Name|定义|二〇一九年|第0页|

data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','Uganda','Uganda','Uganda','Uganda','Japan','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
         'Price': [500, 400, 15,450,750,670,1300,890,500,420,415,0]}
df = pd.DataFrame(data)

我想计算四分位距以识别离群值并提取潜在离群值的指数位置。
我创建了一个函数,但是,我在基于RegionProduct列的组合将该函数应用于Price列时遇到了问题。
我的职能如下:

def tukeys_method(df, variable, iterable1, iterable2):
    itr1 = df[iterable1].unique() #create list of unique values for iterable 1
    itr2 = df[iterable2].unique() #create list of unique values for iterable 2
    for (i,j) in zip(itr1, itr2):

        #Takes two parameters: dataframe & variable of interest as string
        q1 = df.groupby([iterable1,iterable2])[variable].quantile(0.25) #calculate quantiles
        q3 = df.groupby([iterable1,iterable2])[variable].quantile(0.75) #calculate quantiles
        iqr = q3-q1
        inner_fence = 1.5*iqr
        outer_fence = 3*iqr

        #inner fence lower and upper end
        inner_fence_le = q1-inner_fence
        inner_fence_ue = q3+inner_fence

        #outer fence lower and upper end
        outer_fence_le = q1-outer_fence
        outer_fence_ue = q3+outer_fence

        outliers_prob = []
        outliers_poss = []
        for index, x in enumerate(df.groupby([iterable1,iterable2])[variable]):
            if x <= outer_fence_le or x >= outer_fence_ue:
                outliers_prob.append(index)
        for index, x in enumerate(df.groupby([iterable1,iterable2])[variable]):
            if x <= inner_fence_le or x >= inner_fence_ue:
                outliers_poss.append(index)
        return outliers_prob, outliers_poss

probable_outliers_tm, possible_outliers_tm = tukeys_method(df, "Price",'Region','Product')

运行该函数时出现以下错误:

ValueError: operands could not be broadcast together with shapes (570,) (2,)

有人知道我该怎么做才能解决这个问题吗?

fkaflof6

fkaflof61#

我总算搞清楚了,如果有人感兴趣的话,解决办法如下:


# Identify outliers using Tukey's method.

def outliers_tukey(df, variable, iterable1, iterable2):
    outliers_prob = []
    outliers_poss = []
    for (i,j) in itertools.product(df[iterable1].unique(), df[iterable2].unique()):

        #Takes two parameters: dataframe & variable of interest as string
        q1 = df.loc[(df[iterable1]==i) & (df[iterable2]==j)][variable].quantile(0.25)
        q3 = df.loc[(df[iterable1]==i) & (df[iterable2]==j)][variable].quantile(0.75)
        iqr = q3-q1
        inner_fence = 1.5*iqr
        outer_fence = 3*iqr

        #inner fence lower and upper end
        inner_fence_le = q1-inner_fence
        inner_fence_ue = q3+inner_fence

        #outer fence lower and upper end
        outer_fence_le = q1-outer_fence
        outer_fence_ue = q3+outer_fence

        for index, x in enumerate(df[variable]):
            if x <= outer_fence_le or x >= outer_fence_ue:
                outliers_prob.append(index)
        for index, x in enumerate(df[variable]):
            if x <= inner_fence_le or x >= inner_fence_ue:
                outliers_poss.append(index)
        return outliers_prob, outliers_poss

probable_outliers_tm, possible_outliers_tm = outliers_tukey(df, "Price",'Region','Product')

相关问题