我有一个 Dataframe ,如下所示:
| 地区名称|国家/地区|产品名称|年份|标价|
| - -|- -|- -|- -|- -|
| 非洲|南非Name| ABC公司|二〇一六年|500个|
| 非洲|南非Name| ABC公司|二〇一七年|四百人|
| 非洲|南非Name| ABC公司|二〇一八年|十五个|
| 非洲|南非Name| ABC公司|二〇一九年|四百五十人|
| 非洲|乌干达Name| ABC公司|二〇一六年|七百五十人|
| 非洲|乌干达Name| ABC公司|二〇一七年|六百七十人|
| 非洲|乌干达Name| ABC公司|二〇一八年|一千三百|
| 非洲|乌干达Name| ABC公司|二〇一九年|八百九十|
| 亚洲地区|日本Name|定义|二〇一六年|500个|
| 亚洲地区|日本Name|定义|二〇一七年|四百二十人|
| 亚洲地区|日本Name|定义|二〇一八年|四百一十五人|
| 亚洲地区|日本Name|定义|二〇一九年|第0页|
data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','Uganda','Uganda','Uganda','Uganda','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 15,450,750,670,1300,890,500,420,415,0]}
df = pd.DataFrame(data)
我想计算四分位距以识别离群值并提取潜在离群值的指数位置。
我创建了一个函数,但是,我在基于Region
和Product
列的组合将该函数应用于Price
列时遇到了问题。
我的职能如下:
def tukeys_method(df, variable, iterable1, iterable2):
itr1 = df[iterable1].unique() #create list of unique values for iterable 1
itr2 = df[iterable2].unique() #create list of unique values for iterable 2
for (i,j) in zip(itr1, itr2):
#Takes two parameters: dataframe & variable of interest as string
q1 = df.groupby([iterable1,iterable2])[variable].quantile(0.25) #calculate quantiles
q3 = df.groupby([iterable1,iterable2])[variable].quantile(0.75) #calculate quantiles
iqr = q3-q1
inner_fence = 1.5*iqr
outer_fence = 3*iqr
#inner fence lower and upper end
inner_fence_le = q1-inner_fence
inner_fence_ue = q3+inner_fence
#outer fence lower and upper end
outer_fence_le = q1-outer_fence
outer_fence_ue = q3+outer_fence
outliers_prob = []
outliers_poss = []
for index, x in enumerate(df.groupby([iterable1,iterable2])[variable]):
if x <= outer_fence_le or x >= outer_fence_ue:
outliers_prob.append(index)
for index, x in enumerate(df.groupby([iterable1,iterable2])[variable]):
if x <= inner_fence_le or x >= inner_fence_ue:
outliers_poss.append(index)
return outliers_prob, outliers_poss
probable_outliers_tm, possible_outliers_tm = tukeys_method(df, "Price",'Region','Product')
运行该函数时出现以下错误:
ValueError: operands could not be broadcast together with shapes (570,) (2,)
有人知道我该怎么做才能解决这个问题吗?
1条答案
按热度按时间fkaflof61#
我总算搞清楚了,如果有人感兴趣的话,解决办法如下: