pandas 使用any()函数显示异常值

zysjyyx4 于 2022-11-27 发布在其他

关注(0)|答案(1)|浏览(124)

我已经创建了一个5列500行的 Dataframe ，通过执行下面的Python代码， Dataframe 可以保存随机整数值：

RandomValues = pd.DataFrame(np.random.randint(0, 100, size=(500, 5)), 
                 columns=['Name', 'State', 'Age', 'Experience', 'Annual Income'])

以下为数据框：

Name    State   Age Experience  Annual Income
 0    85       10    16         56             89
 1    94        1    87         61             37
 2    51        7    37         18             92
 3    15        1    62         72             60
 4    84       88     1         43             14

...  ...      ...   ...        ...            ...
495   66       33    67         84              7
496   81        2    55         87             59
497   38       50    40         77             36
498   68       45    37         55             90
499   13       82    84         98             35

我使用标准差来查找"年收入"列中的异常值。

upper_limit = RandomValues['Annual Income'].mean() + 3 * RandomValues['Annual Income'].std()
lower_limit = RandomValues['Annual Income'].mean() - 3 * RandomValues['Annual Income'].std()

我如何使用any（）方法找到"RandomValues"数据框的"Annual Income"列中的异常值？感谢您的帮助。
我尝试过使用where（）方法，以及下面的Python代码，但是它没有解决这个问题：高异常值=随机值['年收入']〈上限低异常值=随机值['年收入']〉下限
打印（高异常值）打印（低异常值）
第二，我尝试了以下操作，但得到的是一个空列表系列：

highOutliers = RandomValues.loc[RandomValues['Annual Income'] > upper_limit, 'Annual Income']
lowOutliers = RandomValues.loc[RandomValues['Annual Income'] < lower_limit, 'Annual Income']

print(highOutliers)
print(lowOutliers)

Output:
Series([], Name: Annual Income, dtype: int64)
Series([], Name: Annual Income, dtype: int64)

pandas

来源：https://stackoverflow.com/questions/74512328/displaying-outliers-using-the-any-function

1条答案

按热度按时间

toiithl61#

当您进行这样的比较时，您创建的是boolean系列，它与Annual Income列的形状相同，但包含True/False值

highOutliers_locations = RandomValues['Annual Income'] > upper_limit
lowOutliers_locations = RandomValues['Annual Income'] < lower_limit

这是计算异常值的有用步骤，但您尚未对数据进行子集化。
要真正将 Dataframe 划分为仅包含这些离群值的子集，请使用索引，例如.loc：

highOutliers = RandomValues.loc[highOutliers_locations, 'Annual Income']
lowOutliers = RandomValues.loc[lowOutliers_locations, 'Annual Income']

或者，一步到位：

highOutliers = RandomValues.loc[
    RandomValues['Annual Income'] > upper_limit, 'Annual Income'
]
lowOutliers = RandomValues.loc[
    RandomValues['Annual Income'] < lower_limit, 'Annual Income'
]

更多信息和示例，请参见indexing and selecting data的pandas指南

赞(0）回复(0）举报 2022-11-27

我来回答

pandas 使用any()函数显示异常值

1条答案

相关问题

热门标签

最新问答